245
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Application of Unsupervised Feature Selection in Cashmere and Wool Fiber Recognition

, ORCID Icon, , &

ABSTRACT

Suitable features are the key to identifying cashmere and wool fibers, and feature selection is an important step in classification. Existing supervised feature selection methods need to consider the information between fiber features and class labels. Aiming at making up for this deficiency, we propose an unsupervised feature selection method based on k-means clustering, which overcome the difficulty that fiber feature class labels are either unavailable or costly to obtain. Firstly, the subset of fiber features that have been normalized are clustered by the k-means clustering algorithm to obtain the total number of clusters, and the clustering effect is evaluated by the DB Index criterion. Next, the DB value of each feature subset, the correlation of features and the total number of the clustering are considered as the judgment criteria to select the optimal feature subset. Finally, the optimal subset of features obtained by unsupervised feature selection algorithms is fed into a support vector machine for automatic identification and classification of the two fibers. The experimental results show that the method achieves a high recognition rate of 97.25%. It is verified that the unsupervised feature selection method based on k-means clustering is effective for the recognition of cashmere and wool.

摘要

合适的特征是识别羊绒和羊毛纤维的关键,特征选择是分类的重要步骤. 现有的监督特征选择方法需要考虑纤维特征和类别标签之间的信息. 为了弥补这一不足,我们提出了一种基于k-均值聚类的无监督特征选择方法,该方法克服了纤维特征类标签不可用或获取成本高的困难. 首先,通过k-均值聚类算法对已归一化的纤维特征子集进行聚类,得到聚类总数,并通过DB Index准则评估聚类效果. 接下来,将每个特征子集的DB值、特征的相关性和聚类总数作为选择最优特征子集的判断标准. 最后,将无监督特征选择算法获得的最优特征子集输入到支持向量机中,用于两种纤维的自动识别和分类. 实验结果表明,该方法的识别率高达97.25%. 验证了基于k均值聚类的无监督特征选择方法对羊绒和羊毛的识别是有效的.

Introduction

China is a large producer and exporter of cashmere. Cashmere and wool are similar in appearance and have little difference in chemical properties, but their properties are very different. Since the production of cashmere is scarce, the price of cashmere is dozens of times higher than that of wool. In order to obtain more benefits, many enterprises often mix them together to create some blended products which are popular in the current market. Therefore, in order to safeguard the legitimate rights and interests of consumers, it is particularly important to have accurate and rapid identification of cashmere and wool fibers (Junli et al. Citation2021).

In recent years, there are five main categories of cashmere and wool fiber identification methods: physical methods, chemical methods, biological methods, image methods and deep convolutional network methods. Among the first three methods, common identification methods include DNA detection methods (Zhang et al. Citation2023), Fourier Transform Infrared Spectroscopy (FTIR) (Xu et al. Citation2022), Differential Scanning Calorimeter(DSC) technology (Anceschi et al. Citation2022), and thermal cleavage technology (Haque et al., Citation2022). Their disadvantages are complex operation, poor stability of detection results and low accuracy. Deep convolutional network is a time-consuming method that requires a large number of samples and expensive experimental equipment. The image method, on the other hand, has more applications in the field of identification of similar animal fibers due to low experimental error and high accuracy.

The main task of cashmere and wool fibers image recognition is to classify the samples into appropriate categories with the features extracted from the sample images. Feature fusion (Dai et al. Citation2021) is one of the key technologies in the process of cashmere and wool recognition. In the cashmere and wool recognition method, Sun (Citation2021) fused multi-class texture features into multi-dimensional feature matrix vectors, and they successfully classified cashmere and wool by the LC-KSVD algorithm with a recognition rate of up to 90%. Zhu et al. (Citation2022) used the concept of maximum inter-class variance to linearly fuse the approximate fiber features and detailed features extracted by discrete wavelet transform. The recognition accuracy in support vector machine (SVM) classifier reaches 95.20%. Lu et al. (Citation2019) introduced a technique that relies on fast robust feature recognition of fiber images, where each fiber is represented by a SURF feature histogram. According to these features, a recognition model was established to distinguish cashmere and wool fiber with a recognition rate of 93%. Xing et al. (Citation2019) used the binary image of fibers obtained after preprocessing to calculate its box number and information dimensions using a fractal algorithm. Parallel lines algorithm was used to measure the fiber fineness of the contour image. Finally, the extracted feature set was clustered by the k-means algorithm, which achieved a recognition accuracy of 97.47%. Zhu et al. (Citation2021) proposed a texture feature analysis method based on a gray-scale co-occurrence matrix and the Gabor wavelet transform. By introducing weights, the spatial and frequency domain feature vectors of fiber texture feature are summated linearly to achieve feature fusion. The recognition accuracy of 93.33% was achieved. The aforementioned feature fusion method combines features extracted from fiber images into one feature with more discriminative power than the input features, and can also combine features from different layers or branches to fuse features of different scales to improve the classification accuracy.

In order to improve the recognition rate, it is often desirable to extract as much feature information as possible. However, this may lead to increased feature dimensionality, weakened correlation and redundancy problems. Therefore, proper feature selection (Li et al. Citation2017) plays a key role in accurate cashmere and wool recognition, training time and classifier performance. In the cashmere and wool recognition method based on feature extraction and selection, Zhong et al. (Citation2017) converted microscopic images captured by a charge-coupled device (CCD) camera into projection curves and extracted numerical feature parameters by three different methods: recursive quantization analysis, direct geometric description and discrete wavelet change. They used SVM as a supervised classifier with an accuracy of 98.9%. Zhu et al. (Citation2022) used gray co-occurrence matrix algorithm to extract fiber texture and constructed original high-dimensional feature dataset, and used feature selection algorithm combining correlation analysis and principal component analysis to evaluate, making the classification accuracy reach about 90%. Xing et al. (Citation2020) proposed a new fiber recognition method based on morphological feature extraction and analysis by statistical assumptions of three morphological features: fiber dimension height, fiber diameter and its proportion, and the recognition accuracy could reach 94.2%. Zhu et al. (Citation2023) proposed a recognition method based on the Gray-level co-occurrence matrix (GLCM) and chi-square test for texture feature selection. The chi-square test is used to find the most discriminative 5-dimensional features from the extracted features. These optimal features were then input into SVM for classification, and a recognition accuracy of 94.39% was obtained. Zhu et al. (Citation2022) proposed an optimal parameter selection method based on the fusion of morphological and texture features. According to the relationship between the quadratic statistical values of fiber texture features and pixel spacing, the influencing factors of GLCM were found. And the optimal five-dimensional feature vectors were input into Fisher classifier for classification and identification with an accuracy of 96.71%. The above methods of feature selection are able to filter the most representative features from the raw fiber features and reduce the effects of noise and redundancy, which greatly improves the accuracy and performance of model recognition.

Deep learning models (Matsuo et al. Citation2022) are increasingly used in cashmere and wool recognition. The network model is improved by adjusting the network structure, loss function, etc., and also obtains a better recognition effect. Wang and Jin (Citation2018) proposed a Fast RCNN based method for cashmere wool fiber recognition, firstly using a sigmoid classifier to obtain a rough classification result and initial weights of the model, extracting features using the Fast RCNN method and augmenting the overall features using partial features, and using the network of the 1st round of coarse classification for cashmere and wool image classification with an accuracy of 95.2%. Zang et al. (Citation2022) proposed a fiber recognition algorithm based on multiscale geometric analysis and deep convolutional neural networks (CNNs). Their approach included reducing the dimensionality of wool and cashmere fiber images and eliminating redundant data through multi-scale geometric analysis. A recognition accuracy of 96.67% was achieved by deep CNNs for classification and recognition. In order to solve the difficulties of feature insufficiency and overfitting in network training, Zhu et al. (Citation2022) proposed an improved Xception network-based method for cashmere and wool fiber image recognition. The recognition accuracy of this method is as high as 98.95%, which is at least 2% higher than that of the original Xception network. This improvement proves the effectiveness of their method in improving the performance of cashmere and wool fiber image recognition. The above feature selection methods can effectively identify cashmere and wool fibers, but the feature selection process is relatively simple for the extracted features, and most of them use supervised feature selection algorithms, which inevitably consider the information between features and class labels in the feature selection process and need to mark a set of data to identify and select relevant features, failing to fully consider the information within and between features, resulting in inaccurate classification and low recognition rate. Therefore, in order to remedy this deficiency, this paper applies the unsupervised learning feature selection algorithm to the process of cashmere and wool recognition.

Although feature selection methods can effectively identify cashmere and wool fibers, the feature selection process is relatively simple for the extracted features. Moreover, most of them use supervised feature selection algorithms, which inevitably consider the information between features and class labels in the feature selection process, and need to mark a set of data to identify and select relevant features. It fails to fully consider the information both internal and mutual of features, which leads to inaccurate classification and low recognition rate. Therefore, in order to make up for this deficiency, this paper applies the unsupervised learning feature selection algorithm to the cashmere and wool recognition processes.

In this paper, we propose an unsupervised feature selection algorithm based on k-means clustering according to the influence of features on classification results and correlation analysis between features. Distance and similarity based on clustering analysis is used as the core of the feature selection algorithm to reduce redundancy for the purpose of dimensionality reduction. The algorithm maintains as much as possible the original results and preserves the original information of the sample data. It is suitable for unsupervised data and overcomes the difficulty that class labels cannot be obtained or are costly to gain. The basic idea is to use the k-means clustering algorithm for each feature subset to determine its optimal number of classifications. Then a judgment function is set for feature selection using the DB Index criterion. Finally, one of the more relevant features is removed from the selected feature subset.

An overview of the identification processes for cashmere and wool

In this paper, we address the challenge of distinguishing between cashmere and wool fibers, which can be complicated due to variations in growth environment, breeding methods, and mating. To overcome this problem, we propose a method that combines unsupervised feature selection with fiber recognition. This approach reduces the need for extensive sample labeling in the initial stages and enhances recognition efficiency. By utilizing image processing techniques, the method extracts both texture and morphological features from fiber images. The best subset of features is selected based on the concept of the k-means clustering algorithm. This process enables more accurate discrimination between cashmere and wool fibers. The method described in the paper follows a four-step process, as illustrated in . Firstly, the acquired fiber source image undergoes preprocessing to eliminate the background. Then, the target image is used for feature extraction, employing a grayscale co-occurrence matrix to extract the fiber’s texture features. Additionally, the geometric morphological features of the fiber are computed by the chain code tracking algorithm and segmentation measurement method. Then, the extracted fiber feature data with different magnitudes are normalized with min-max method, and feature selection is performed for each feature subset to acquire the optimal feature subset by k-means clustering algorithm. Finally, SVM is used for classification in order to achieve the highest level of recognition accuracy.

Figure 1. The flow chart of the research method of this paper.

Figure 1. The flow chart of the research method of this paper.

Methods

Standardization of cashmere and wool feature data

Since the cashmere and wool features extracted in this paper contain both texture features and morphological features, there may be differences in the nature, size, order of magnitude, and availability of each feature attribute. In fact, we cannot directly analyze the features and patterns of the research object. Therefore, in order to ensure the reliability of the results, it is necessary to normalize the raw feature data. So that each feature is in the same fluctuation range, which solves the problem of comparability between data indicators. Thus, it is suitable for the subsequent unsupervised feature selection. The commonly used data normalization methods are min-max (Patro and Sahu Citation2015) and Z-score normalization (Fei et al. Citation2021). To normalize the data, the min-max approach is chosen since the cashmere and wool feature dataset does not comply with the Gaussian distribution. Min-max is a linear transformation of the original data mapping to the range of [0,1]. The transformation function is shown in formula (1).

(1) Xnorm=XXminXmaxXmin(1)

Where Xmax is the maximum value of the sample data and Xmin is the minimum value. Xnorm is the normalized sample X data

Determination of clustering number of feature subset

The k-means clustering algorithm is used to determine the number of clusters for each subset of cashmere and wool features. The DB Index criterion, introduced by (Breaban and Luchian Citation2011), combines both inter-class distance and intra-class dispersion to evaluate the effectiveness of clustering. In this study, the DB Index criterion is used as the evaluation metric for clustering effectiveness. According to the DB Index criterion, a smaller value of DBk indicates a better classification effect. The fundamental components of the DB Index criterion are as follows:

Step 1: Average intra-class dispersion is shown in formula (2).

(2) Si=1CiXCiXZi(2)

Where Zi refers to the centroid of the Ci class; Cidenotes Ci the number of samples in that class.

Step 2: The interclass distance is expressed as the distance between two class centers, as shown in formula (3).

(3) dij=ZiZj(3)

Step 3: DB Index is shown in formula (4).

(4) DBk=1ki=1kRiRi=maxj=1,,jkSi+Sjdij(4)

Where k is the number of categories.

In this paper, an iterative approach is employed to cluster the cashmere and wool feature dataset without any prior sample distribution information. The goal is to determine the optimal number of clusters, which will not exceed (maximum number of clusters). n represents the total number of features in each cashmere and wool feature subset. To conduct the clustering experiments, 500 feature subsets of each cashmere and wool are selected. The iterative algorithm for determining the number of clusters is performed within the range of 2 to 100. To minimize the number of iterations, a value much smaller thann is set according to the specific actual situation of the cashmere and wool feature subset, where the maximum number of iterations is 10. Assuming that each feature subset of cashmere and wool is xi, and its corresponding clustering number is ki, the specific process of determining ki is as follows.

Firstly, start to initialize the settings. According to the DB Index criterion, C represents the iterative variable of the cluster count. C is initially set to 2, ki which is 1, and is set DB to the minimum DB value, DB which is. Secondly, the k-means clustering algorithm is used to cluster the sample features, a feature subset of cashmere and wool is randomly selected, and the cluster center is initialized from the data. The initial cluster center selection is the most crucial aspect of the k-means clustering algorithm, as it can significantly impact the final outcomes of the clustering process. The location of the initial cluster centers plays a critical role in determining the convergence and quality of the resulting clusters. Therefore, careful consideration should be given to the placement of the initial cluster centers to ensure optimal results. Inspired by the success of k-means++, this paper inherent the idea of the k-means++ clustering algorithm (Kapoor and Singhal Citation2017) and combines the analysis with the cashmere and wool feature dataset to improve the determination method of the initial cluster center location in the original k-means clustering algorithm.

To determine the initial cluster centers, we first calculate the uniform distribution function for a subset of cashmere and wool features. Then, a feature value is randomly chosen as the first initial cluster center based on its uniform distribution. Next, the shortest distance d between each feature and the current cluster center is computed. According to the formula (5), we calculate the probability P for each feature to be selected as the next cluster center. Finally, we select the feature value corresponding to the highest probability as the next cluster center. We repeat these steps until we have selected k initial cluster centers. After that, according to Euclidean distance (S. P. Patel and Upadhyay Citation2020) as shown in formula (6). The distance between each feature and the k initial clustering centers is calculated and it is classified into the class with the closest distance. The above operation is repeated continuously and after all feature points are classified, each clustering center is calculated again until the clustering ends when the judgment function formula (7) satisfies dj(i)α(α is the set threshold).

If the value of DB* at this time is less than DBc , assign DBc to DB* , and ki=C. Next, increase the value of C by 1 each time. A clustering operation continues only if the value of C is less than the maximum of 10 iterations. Otherwise, it ends. What is obtained at this time is ki the optimal classification number corresponding to this feature subset.

(5) P=d2d2(5)
(6) distX,Y=i=1nxiyi2(6)
(7) dji=DBci+1DBciDBci(7)

Where DBc(i) refers to the DB value of the i-th cluster when there are C clusters.

Judgment rules for selecting feature subsets

The two cashmere and wool feature subsets xi, xj (i = 1…t, j = 1…t, ij, t is the number of feature subsets) corresponding to the features are not exactly the same, so the values of DBk obtained for different feature subsets xi, xj are not comparable, and thus the judgment rules need to be normalized. Assuming that the classification result corresponding to feature subset xi is ki, the corresponding DB value DBki is obtained according to the DB Index criterion, and then according to the judgment function as shown in formula (8), the selection of feature subset is to choose the xi that makes the minimum value of formula (8).

(8) normalizedcritxi=1tp=1tDBki(8)

Unsupervised feature selection algorithm based on k-means clustering Method

Feature selection as a dimensionality reduction method can effectively remove redundant features that are irrelevant to the data classification task and retain a small number of key features. It both reduces the computational complexity of data classification or clustering and improves the accuracy of machine learning algorithms. The unsupervised feature selection algorithm based on k-means clustering method proposed in this paper first calculates the number of clusters for each feature subset through multiple iterations. Then the clustering effect of each feature subset is directly judged by DB Index criterion. Finally the feature subsets are selected according to the judgment rule function. The whole process only considers the information content of each feature subset itself, and does not need to consider the information content of class labels. For the final selected feature subset, the correlation between different feature subsets is calculated, the feature with less correlation is reserved, and one of the features is deleted. In this way the best feature subset is selected and the number of features is reduced. The entire unsupervised feature selection algorithm flow is shown in .

Figure 2. The flow chart of unsupervised feature selection algorithm based on k-means clustering.

Figure 2. The flow chart of unsupervised feature selection algorithm based on k-means clustering.

Let S represent the selected feature subset that is initially empty and F denote the original feature set that initially contains all features. In this paper, heuristic search (He et al. Citation2019) is used to generate feature subsets. Therefore, the whole process of k-means clustering based heuristic search unsupervised feature selection algorithm is as follows: (1)The first feature xi1 is selected. one feature xj is selected in the original feature set each time, and then k-means clustering analysis is carried on the feature xj and DBk(xj) is calculated for the clustering results, and the feature xi1 is selected to satisfy.

(9) xi1=arg(min(normalizedcrit(xi))),xiarg(clusterTotal(xj)>ε),xjFS={xi1},F=F{xi1}(9)

Where clusterTotal(xj) denotes the total number of clusters obtained by selecting the j-th feature xj for cluster analysis, and normalizedcrit(xi) denotes the normalized DBk value obtained by selecting the i-th feature xi for cluster analysis. When a feature is added, if the total number of clusters obtained from the clustering result exceeds a dynamically set threshold value and the DBk(xi) is smaller, it indicates that the added feature xi is effective for clustering. In other words, it is considered an important feature.

(2) The second feature xi2 is selected. a feature xj is selected in the original feature set F. The data corresponding to the combination of features xjS is clustered using the feature xjS and the corresponding DBk(xj) value is calculated for its clustering result. the selected feature xi2 satisfies

(10) xi1=arg(min(normalizedcrit(xi)+ρxixi1)),wherexixi1\amp\ampxiarg(clusterTotal(xj)>ε),xjFS=Sxi2,F=Fxi2(10)

(3) Select the p-th feature xip, p = 3,…,N. Select a feature xj in the original feature set F in turn, cluster the data corresponding to the combination of xjS features and calculate the corresponding DBk(xj) values for the clustering results, and select the feature xip satisfying

(11) xip=arg(min(normalizedcrit(xi)+1p1r=1p1ρxixir)),wherexixi1,,xip1\amp\ampxiarg(clusterTotal(xj)>ε),xjFS=Sxip,F=Fxip(11)

From EquationEquations (10) and (Equation11), it can be obtained that in the selection of the i(i = 2,…,N) features, the correlation between the features is also taken into account.

Step (3) is repeatedly executed until m features are selected. If m = N, where N is the total number of original features, then all the features are sorted.

Classification

In this paper, we compared the performance of SVM (Pisner and Schnyer Citation2020), Naive Bayesian (Chen et al. Citation2020), and K-Nearest Neighbors (KNN) (Taunk et al. Citation2019) classification algorithms. These algorithms are commonly used supervised learning techniques for handling different types of data. SVM is a widely employed generalized linear classifier that maximizes the margin between support vectors and the decision boundary. It performs well on small datasets and is particularly suitable for handling nonlinear data patterns. Naive Bayesian algorithm is based on the Bayesian theorem and assumes feature independence. It calculates the joint probability distribution of input and output and utilizes conditional probability formulas to determine the probability of a sample belonging to a particular class. Naive Bayesian algorithm excels in tasks such as text classification. KNN algorithm predicts the class of a new instance by calculating the distance between it and the k nearest neighbors in the training dataset. It assigns class labels based on the majority voting rule. KNN is an intuitive and straightforward algorithm suitable for small-scale datasets and low-dimensional feature spaces.

Experimental results and analysis

Experimental result of image pre-processing

For the research study, a total of 500 cashmere and wool fiber images were obtained by a scanning electron microscope with a magnification of 1000 times. These images were then captured and stored in a computer system with dimensions of 275 × 275 pixels. displays the captured fiber images of cashmere and wool, while the outcomes of two distinct pre-processing methods can be observed in .

Figure 3. (a) Wool and (b) cashmere.

Figure 3. (a) Wool and (b) cashmere.

Figure 4. Preprocessing for extracting texture features: (a) original image, (b) edge detection, (c)filling margin, (d) removal of background.

Figure 4. Preprocessing for extracting texture features: (a) original image, (b) edge detection, (c)filling margin, (d) removal of background.

Figure 5. Preprocessing for extracting morphological characteristics: (a) original image, (b) image enhancement, (c) fiber binarization image, (d) removal of impurity.

Figure 5. Preprocessing for extracting morphological characteristics: (a) original image, (b) image enhancement, (c) fiber binarization image, (d) removal of impurity.

Experimental results of fiber feature extraction

After pre-processing the single contour fiber images, we computed a set of 56 features based on quadratic statistics of the GLCM (Garg and Dhiman Citation2021). These features were calculated in four different orientations and served as descriptors for the original samples. They are mean, entropy, energy, correlation, contrast, angular second order moment, inertia, variance, cluster shading, cluster saliency, homogeneity, total entropy, difference and difference entropy (D. R. Patel et al. Citation2020). These features are computed for angles of 0°, 45°, 90°, and 135°. showcases the texture characteristics of cashmere. The scale area, scale perimeter, scale diameter, scale interval, scale rectangle factor, relative scale area, relative scale perimeter, and relative scale interval (diameter-to-interval ratio) of fiber scales were measured from the single-pixel wide fiber binary image and other 8 geometric features, as shown in .

Table 1. The original description of the texture features extracted in different orientations.

Table 2. The initial description of morphological characteristics.

Results of fiber optimal feature selection

When unsupervised feature selection algorithm based on k-means clustering is used for feature selection. The first and foremost thing is the determination of the number of clusters of cashmere and wool feature subsets. For the initial feature description of a total of 64 original sample images of cashmere and wool texture and shape extracted, 500 pieces of cashmere and wool were selected respectively during the experiment to form a 64-dimensional feature subset, the maximum number of iterations was set to 10, and α was set to 0.0001 in the clustering end function. The final number of cluster classes per dimensional feature subsets was shown in . The calculated DBk values are shown in . Then, the DBk values obtained were normalized according to the judgment rules in the selection process of cashmere feature subset, and they were arranged in the order from largest to smallest. Finally, Pearson correlation coefficient formula was used for feature correlation analysis. During the experiment, the unsupervised feature selection algorithm was adopted for all cashmere and wool features, and the importance of all features in the experimental results was ranked from greatest to smallest, as shown in .

Figure 6. The final number of clusters per dimensional feature subset: (a) number of texture feature clusters, (b)Number of morphological characteristics clusters.

Figure 6. The final number of clusters per dimensional feature subset: (a) number of texture feature clusters, (b)Number of morphological characteristics clusters.

Table 3. The DBk value of characteristic subset.

Table 4. The rank of all significant feature subsets.

Comparison experimental results of other feature selection algorithms

In order to verify the effectiveness of the unsupervised feature selection algorithm based on K-means clustering. This paper compares the recognition accuracy of six supervised feature selection algorithms, including chi-square validation, correlation coefficient, reliefF (Pathina and Kumar Citation2022), T-test (Gerald Citation2018), FisherScore (KP and Thiyagarajan Citation2022), GiniIndex (Miao et al., Citation2022), and unsupervised feature selection algorithms based on pattern similarity judgment (Basak, De and Pal SK, Citation1998) on cashmere and wool fiber dataset respectively. 70% of each dataset was randomly selected for training and 30% for testing. The class labels of the training set were removed when we utilized the unsupervised feature selection algorithm. Since there is no criterion associated with the number of optimal features in the sample set, the effectiveness of the algorithm is evaluated by setting the number of selected features artificially. To avoid the accidental results of the experiments, each experiment is repeated 10 times to take the average value. The experimental results are shown in , which shows that the unsupervised feature selection algorithm based on k-means clustering outperforms all other algorithms on cashmere and wool feature dataset.

Figure 7. Recognition accuracy of different feature selection algorithms on cashmere and wool dataset (for SVM classifier).

Figure 7. Recognition accuracy of different feature selection algorithms on cashmere and wool dataset (for SVM classifier).

Compared with other algorithms, the recognition rate of supervised feature selection algorithm based on correlation coefficient is relatively low. This is because the sample size of the dataset is small, resulting in large fluctuations of the correlation coefficient, even close to 1. It is inappropriate to judge that there is a close linear relationship between the features and the classification results just based on the large correlation coefficient. The recognition rates of supervised feature selection algorithms such as chi-square test, reliefF, T-test, FisherScore and GiniIndex are all around 95%. The recognition rate of reliefF and GiniIndex feature selection algorithms decreases as the number of feature dimension increases, because the reliefF algorithm needs to assign weights to the variables according to the correlation between the features and the target variables, and the correlation between the target variables becomes more complicated with the increase of feature dimensions, which leads to the decrease of the recognition rate. The reason for the decrease of the recognition rate of GiniIndex feature selection algorithm is that it is easy to ignore the correlation between the features and the target variables.

The performance of the T-test and FisherScore feature selection algorithms fluctuates as the number of feature dimension increases. The T-test requires feature data to follow a normal distribution, however, some of the cashmere and wool feature data do not satisfy this condition, which affects the stability of the algorithms. The FisherScore algorithm does not take into account the correlation between the cashmere and wool features during feature selection, which results in the recognition rate fluctuation. The feature selection algorithm based on pattern similarity judgment shows a decrease in recognition rate when the number of feature dimension increases, similar to the FisherScore algorithm. The recognition rate of the chi-square test showed an increasing trend, but was lower because the method only considered the correlation between features and output values, ignoring the effect of correlation between features on the recognition rate.

Since the accuracy and other evaluation indexes are only quantification of experimental results, they cannot highlight the differences between different algorithms. The T-test is one of the simplest statistical methods to assess whether there is a statistical difference between two different samples. In this paper, we use the T-test method to compare the differences between different feature selection algorithms more intuitively. Where the T-value is a statistic calculated during the T-test, the sign of the T-value represents the directionality between the means, while the absolute value of the T-value represents the significance. The larger the T-value, the more significant the difference between the two methods. The P-value is an indicator used in statistical hypothesis testing to assess the degree to which the observed sample data supports the null hypothesis. In the T-test, a significance level of 0.05 is usually used. If the P-value is less than or equal to 0.05, we usually reject the null hypothesis and consider the result to be statistically significant. If the P-value is greater than 0.05, we generally cannot reject the null hypothesis, indicating that we do not have enough evidence to support a conclusion different from the null hypothesis. We used the T-test method to compare the differences between the unsupervised feature selection algorithm based on k-means proposed in this paper and other feature selection algorithms. The experimental results are shown in . It is evident that our proposed unsupervised feature selection algorithm based on k-means clustering exhibits a significant difference from other feature selection algorithms. This disparity arises due to the diverse feature selection strategies employed by different algorithms, each with distinct interpretations of feature importance. Our algorithm relies more on the clustering structure of the data, while other algorithms utilize varying evaluation criteria such as information gain and analysis of variance. These divergent feature selection strategies and interpretations of feature importance contribute to the observed significant performance differences.

Table 5. T-test results of the experiment (for cashmere and wool dataset).

We also conducted comparative experiments on other public datasets, illustrated in , and the experimental results are shown in . On the Arrhythmia dataset, the algorithm of this paper has obvious advantages. However, on the other two dataset, although the algorithm of this paper has an advantage in accuracy, it is not obvious. In addition, the higher the dimensionality of the dataset, the greater the gap between the algorithms. The Arrhythmia dataset and the cashmere and wool feature dataset we extracted have many samples and few features, and the classification difficulty is relatively low. Experiments on these two datasets use various algorithms for feature selection, and the recognition accuracy obtained both reached more than 90%, and the difference was within 5%; while GLRC and Colon are high-dimensional data with small samples, and the classification is difficult. In the experiments conducted on these two datasets, different algorithms perform feature selection, and the difference in recognition accuracy can reach more than 15%. As can be seen, the choice of an appropriate feature selection algorithm is particularly important for small samples of high-dimensional data.

Table 6. The publicly available dataset in the experiment.

Table 7. Recognition accuracy of different feature selection algorithms on public dataset(For SVM classifier).

In the end, any algorithm will perform differently on different dataset. For the two algorithms ReliefF and FisherScore, on the GLRC dataset, the former has an advantage over the latter, but on the Colon dataset, the advantage of latter is more prominent. But all in all, the performance of the unsupervised feature algorithm based on k-means clustering is relatively stable. Similarly, as shown in , the proposed algorithm in this paper can be found to be relatively stable in performance when the Bayesian classifier is used. It can adapt to the needs of practical application of cashmere and wool classification.

Table 8. Recognition accuracy of different feature selection algorithms on public dataset (for Bayes classifier).

Comparison experiments with existing fiber identification methods

In , different fiber mixing ratios are used to evaluate the methods presented in this paper. When the number of cashmere and wool samples is the largest and the ratio is 1:1, the identification accuracy is the highest. The reason for this is that more samples can adequately describe the properties of different fibers. Under different blending ratios of cashmere and wool, the accuracy of the proposed method is about 97%, and the performance is stable.

Table 9. Precision of identification of various numbers of samples.

In addition, in the field of cashmere and wool fiber recognition, many scholars have also used deep learning methods. We respectively compared the performance of the neural network model proposed by Wang et al., Zang et al., Huang et al., and the standard VGG16 and Resnet18 convolutional neural network models on the cashmere and wool fiber dataset. The experimental results are shown in . At the same time, we also used the T-test algorithm to compare the differences between different models, and the results are shown in . It is evident that our proposed unsupervised feature selection algorithm based on k-means clustering exhibits significant differences from other convolutional neural network models. This disparity arises due to our algorithm employing a different strategy for feature selection, relying more on the clustering structure of the data. In contrast, convolutional neural networks typically automatically learn features from raw data through structures like convolutional and pooling layers, focusing more on learning local and global features of the data. These distinct feature selection methods and differences in data representation contribute to the observed significant performance differences.

Table 10. Performance of convolutional neural network models on cashmere and wool dataset.

Table 11. T-test results of different models.

At the same time, this paper also compares the performance of the recognition algorithms proposed by Lu et al. (Citation2019), Xing et al. (Citation2020), Sun (Citation2021), Zhu et al. (Citation2022) and Zhu et al. (Citation2023) with the unsupervised feature selection algorithms based on k-means on cashmere and wool dataset. The experimental results show that the feature selection algorithm proposed in this paper achieves a relatively high recognition rate, as shown in . This demonstrates the effectiveness of the proposed feature selection approach in enhancing the accuracy of cashmere and wool recognition.

Figure 8. Comparison of different recognition methods.

Figure 8. Comparison of different recognition methods.

Conclusion and future work

In this paper, an unsupervised feature selection method based on k-means clustering is proposed for cashmere and wool fiber recognition. The method first preprocesses the cashmere and wool fiber images with two different preprocessing methods, and then computes a total of 64 features of cashmere and wool texture and morphology with the GLCM algorithm, the chain code tracking algorithm, and the segmentation measurement method. Next, the k-means clustering algorithm is used to calculate the number of clusters and the corresponding DBk value of each feature subset respectively, and arrange them in descending order. Finally, the purpose of feature selection is achieved according to the judgment rule and relevance of each feature subset, and the optimal feature subset selected is trained and classified by SVM. In the experiment, in order to verify the effectiveness of the unsupervised feature selection algorithm based on k-means clustering. The recognition accuracy of other supervised feature selection algorithms and unsupervised feature selection algorithms is compared on the cashmere and wool feature dataset and other public datasets. The algorithm is compared with the existing deep learning recognition algorithms and other recognition algorithms in the field of cashmere and wool, and the significant differences between them are compared by using T-test method. The results show that the recognition rate of the proposed method is higher and the difference between the proposed method and other algorithms is significant.

In the planning of future work, we will explore diverse clustering algorithms and try different distance metrics to more fully capture the similarity between samples. At the same time, we aim to expand the feature set by delving into deeper-level characteristics of cashmere and wool fibers, not confining ourselves solely to the texture and morphological features from an expert perspective. Furthermore, we plan to augment the sample dataset, incorporating samples of different modified fibers, to thoroughly evaluate the algorithm’s performance in real-world applications. Through these measures, we anticipate elevating the algorithm’s comprehensiveness and applicability, enabling it to better address the diversity of practical problems.

Highlights

  • The unsupervised feature selection algorithm is applied in the process of cashmere and wool fiber recognition, which overcomes the difficulty that class tags cannot be obtained or the cost is high.

  • We propose an unsupervised feature selection algorithm based on K-means clustering based on the influence of features on classification results and correlation analysis among features.

  • The K-means clustering algorithm was used to determine the optimal classification number of each cashmere and wool feature subset, and then a judgment function was set based on the DB Index criterion for feature selection. Finally, one of the features with greater relevance was deleted from the selected feature subset to achieve the purpose of feature selection.

  • Cluster analysis based on distance and similarity is used as the core of feature selection algorithm to reduce redundancy and achieve the purpose of dimension reduction.

  • The proposed algorithm preserves as much as possible the original results of the sample data and the original information of the sample data. It is also applicable to other unsupervised data.

Credit contribution statement

Yaolin Zhu: Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Writing-original draft; Writing-review & editing. Xingze Wang: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing-original draft; Writing-review & editing. Meihua Gu: Resources; Validation. Gang Hu:Formal analysis; Resources; Supervision; Validation. Wenya Li: Resources; Supervision.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Additional information

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was support by the natural science basic research key program funded by Shaanxi Provincial Science and Technology Department (No. 2022JZ-35 and No. 2023-JC-ZD-33), the key research program industrial textiles Collaborative Innovation Center Project of Shaanxi Provincial Department of education(No. 20JY026) and Science and Technology plan project of Yulin City (No.CXY-2020-052).

References

  • Anceschi, A., M. Zoccola, R. Mossotti, P. Bhavsar, G. Dalla Fontana, and A. Patrucco. 2022. “Colorimetric Quantification of Virgin and Recycled Cashmere Fibers: Equilibrium, Kinetic, and Thermodynamic Studies.” Journal of Natural Fibers 19 (15): 11064–20. https://doi.org/10.1080/15440478.2021.2009399.
  • Basak, J., R. K. De, and S. K. Pal. 1998. “Unsupervised feature selection using a neuro-fuzzy approach.” Pattern Recognition Letters 19 (11): 997–1006. https://doi.org/10.1016/S0167-8655(98)00083-X.
  • Breaban, M., and H. Luchian. 2011. “A Unifying Criterion for Unsupervised Clustering and Feature Selection.” Pattern Recognition 44 (4): 854–865. https://doi.org/10.1016/j.patcog.2010.10.006.
  • Chen, S., G. I. Webb, L. Liu, and X. Ma. 2020. “A Novel Selective Naïve Bayes Algorithm.” Knowledge-Based Systems 192:105361. https://doi.org/10.1016/j.knosys.2019.105361.
  • Dai, Y., F. Gieseke, S. Oehmcke, Y. Wu, K. Barnard. 2021. Attentional Feature Fusion. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3560–3569. https://doi.org/10.1109/wacv48630.2021.00360
  • Fei, N., Y. Gao, Z. Lu, T. Xiang. 2021. Z-Score Normalization, Hubness, and Few-Shot Learning. Proceedings of the IEEE/CVF International Conference on Computer Vision, 142–151. https://doi.org/10.1109/iccv48922.2021.00021
  • Garg, M., and G. Dhiman. 2021. “A Novel Content-Based Image Retrieval Approach for Classification Using GLCM Features and Texture Fused LBP Variants.” Neural Computing and Applications 33 (4): 1311–1328. https://doi.org/10.1007/s00521-020-05017-z.
  • Gerald, B. 2018. “A brief review of independent, dependent and one sample t-test.” International Journal of Applied Mathematics and Theoretical Physics 4 (2): 50–54. https://doi.org/10.11648/j.ijamtp.20180402.13.
  • Haque, A. N. M. A., M. Naebe, D. Mielewski, A. Kiziltas. 2022. “Thermally Stable Micro-Sized Silica-Modified Wool Powder from One-Step Alkaline Treatment.” Powder Technology 404. https://doi.org/10.1016/j.powtec.2022.117517.
  • He, B., S. Shah, C. Maung, G. Arnold, G. Wan, and H. Schweitzer. 2019. “Heuristic Search Algorithm for Dimensionality Reduction Optimally Combining Feature Selection and Feature Extraction.” Proceedings of the AAAI Conference on Artificial Intelligence 33 (1): 2280–2287. https://doi.org/10.1609/aaai.v33i01.33012280.
  • Junli, L., L. Kai, Z. Boping, Z. Yang, C. Yonggang, and T. Jinsui. 2021. “Current Status and Progress of Identification Methods for Cashmere and Wool Fibers.” Wool Textile Journal 49:49. https://doi.org/10.19333/j.mfkj.20210104606).
  • Kapoor, A., A. Singhal. 2017. A Comparative Study of K-Means, K-Means++ and Fuzzy C-Means Clustering Algorithms. 2017 3rd international conference on computational intelligence & communication technology (CICT), IEEE, 1–6. https://doi.org/10.1109/ciact.2017.7977272
  • KP, M. N., and P. Thiyagarajan. 2022. “Feature Selection Using Efficient Fusion of Fisher Score and Greedy Searching for Alzheimer’s Classification.” Journal of King Saud University-Computer & Information Sciences 34:4993–5006. https://doi.org/10.1016/j.jksuci.2020.12.009.
  • Li, J., K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu. 2017. “Feature selection: A data perspective.” ACM Computing Surveys (CSUR) 50 (6): 1–45. https://doi.org/10.1145/3136625.
  • Lu, K., J. Luo, Y. Zhong, and X. Chai. 2019. “Identification of Wool and Cashmere SEM Images Based on SURF Features.” Journal of Engineered Fibers and Fabrics 14:1558925019866121. https://doi.org/10.1177/1558925019866121.
  • Matsuo, Y., Y. LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, et al. 2022. “Deep Learning, Reinforcement Learning, and World Models.” Neural Networks 152:267–275. https://doi.org/10.1016/j.neunet.2022.03.037.
  • Miao, Y., J. Wang, B. Zhang, and H. Li. 2022. “Practical Framework of Gini Index in the Application of Machinery Fault Feature Extraction.” Mechanical Systems and Signal Processing 165:108333. https://doi.org/10.1016/j.ymssp.2021.108333.
  • Patel, D. R., H. Thakker, M. Kiran, and V. Vakharia. 2020. “Surface Roughness Prediction of Machined Components Using Gray Level Co-Occurrence Matrix and Bagging Tree.” FME Transactions 48 (2): 468–475. https://doi.org/10.5937/fme2002468P.
  • Patel, S. P., and S. H. Upadhyay. 2020. “Euclidean Distance Based Feature Ranking and Subset Selection for Bearing Fault Diagnosis.” Expert Systems with Applications 154:113400. https://doi.org/10.1016/j.eswa.2020.113400.
  • Pathina, J. M. R., P. R. Kumar. 2022. Classification of Non-Fluctuating Radar Target Using ReliefF Feature Selection Algorithm, Evolution in Signal Processing and Telecommunication Networks: Proceedings of Sixth International Conference on Microelectronics, Electromagnetics and Telecommunications (ICMEET 2021), 2, Springer, https://doi.org/10.1007/978-981-16-8554-5_18
  • Patro, S., and K. K. Sahu. 2015. “Normalization: A preprocessing stage.” IARJSET 20–22. arXiv preprint arXiv:1503.06462. https://doi.org/10.17148/IARJSET.2015.2305.
  • Pisner, D. A., and D. M. Schnyer. 2020. SVM, Machine Learning 101–121. Elsevier. https://doi.org/10.1016/b978-0-12-815739-8.00006-7.
  • Sun, C. 2021. Image Classification of Cashmere and Wool Fiber Based on LC-KSVD. Journal of Physics: Conference Series, IOP Publishing, 12010. https://doi.org/10.1088/1742-6596/1948/1/012010
  • Taunk, K., S. De, S. Verma, A. Swetapadma. 2019. A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. 2019 International Conference on Intelligent Computing and Control Systems (ICCS), IEEE, 1255–1260. https://doi.org/10.1109/iccs45141.2019.9065747
  • Wang, F., and X. Jin. 2018. “The Application of Mixed-Level Model in Convolutional Neural Networks for Cashmere and Wool Identification.” International Journal of Clothing Science and Technology 30 (5): 710–725. https://doi.org/10.1108/IJCST-11-2017-0171.
  • Xing, W., Y. Liu, N. Deng, B. Xin, W. Wang, and Y. Chen. 2020. “Automatic Identification of Cashmere and Wool Fibers Based on the Morphological Features Analysis.” Micron 128:102768. https://doi.org/10.1016/j.micron.2019.102768.
  • Xing, W., B. Xin, N. Deng, Y. Chen, and Z. Zhang. 2019. “A Novel Digital Analysis Method for Measuring and Identifying of Wool and Cashmere Fibers.” Measurement 132:11–21. https://doi.org/10.1016/j.measurement.2018.09.032.
  • Xu, W., J. Xia, S. Min, and Y. Xiong. 2022. “Fourier Transform Infrared Spectroscopy and Chemometrics for the Discrimination of Animal Fur Types. Spectrochimica Acta Part a.” Molecular & Biomolecular Spectroscopy 274:121034. https://doi.org/10.1016/j.saa.2022.121034.
  • Zang, L., B. Xin, and N. Deng. 2022. “Identification of Wool and Cashmere Fibers Based on Multiscale Geometric Analysis.” The Journal of the Textile Institute 113 (6): 1001–1008. https://doi.org/10.1080/00405000.2021.1914399.
  • Zhang, X., X. Wu, H. Yang, H. Zheng, and Y. Zhou. 2023. “Identification of Cashmere and Wool by DNA Barcode.” Journal of Natural Fibers 20 (1): 2175100. https://doi.org/10.1080/15440478.2023.2175100.
  • Zhong, Y., K. Lu, J. Tian, and H. Zhu. 2017. “Wool/Cashmere Identification Based on Projection Curves.” Textile Research Journal 87:1730–1741. https://doi.org/10.1177/0040517516658516.
  • Zhu, Y., J. Duan, Y. Li, and T. Wu. 2022. “Image Classification Method of Cashmere and Wool Based on the Multi-Feature Selection and Random Forest Method.” Textile Research Journal 92 (7–8): 1012–1025. https://doi.org/10.1177/00405175211046060.
  • Zhu, Y., J. Huang, T. Wu, and X. Ren. 2021. “Identification Method of Cashmere and Wool Based on Texture Features of GLCM and Gabor.” Journal of Engineered Fibers and Fabrics 16:1558925021989179. https://doi.org/10.1177/1558925021989179.
  • Zhu, Y., J. Huang, T. Wu, and X. Ren. 2022. “An Identification Method of Cashmere and Wool by the Two Features Fusion.” International Journal of Clothing Science and Technology 34 (1): 13–20. https://doi.org/10.1108/IJCST-06-2020-0101.
  • Zhu, Y., H. JiaYI, Y. Li, and W. Li. 2022. “Image Identification of Cashmere and Wool Fibers Based on the Improved Xception Network.” Journal of King Saud University-Computer & Information Sciences 34:9301–9310. https://doi.org/10.1016/j.jksuci.2022.09.009).
  • Zhu, Y., L. Zhao, X. Chen, Y. Li, and J. Wang. 2022. “Animal Fiber Recognition Based on Feature Fusion of the Maximum Inter-Class Variance.” AUTEX Research Journal 23 (4): 560–566. https://doi.org/10.2478/aut-2022-0031.
  • Zhu, Y., L. Zhao, X. Chen, Y. Li, and J. Wang. 2023. “Identification of Cashmere and Wool Based on LBP and GLCM Texture Feature Selection.” Journal of Engineered Fibers and Fabrics 18:15589250221146548. https://doi.org/10.1177/15589250221146548.