Full article: Few-shot segmentation based on multi-level and cross-scale clustering

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The problem of image segmentation with few-shot learning is addressed in this paper, which is a challenging task due to the lack of sufficient high-precision annotated data. A novel method that consists of two modules is proposed: a multi-level fuzzy clustering guidance module and a cross-scale feature fusion module. The former module can extract image features in a class-independent feature space and fuse them with different scale information, while the latter module can reduce the information loss caused by cross-scale transmission. The feature association map between the support image and the query image can be learned by the proposed method, and the inconsistency of target object categories can be overcome. The proposed method is evaluated on Pascal and COCO datasets, and it is shown that it outperforms the state-of-the-art algorithms in both one-shot and k-shot segmentation scenarios.

KEYWORDS:

Introduction

Using deep learning methods to solve image processing problems is one of the current research hotspots (Liu, Liu, Cao, et al., Citation2022; Zhang et al., Citation2021). However, deep learning techniques are highly data-dependent. When the high-precision labelled training samples are insufficient, the algorithm model of deep learning cannot achieve convincing results (Eltouny et al., Citation2023). Labelling data is not only a heavy workload but also requires markers with relevant expertise (Deng et al., Citation2009; Sung et al., Citation2018; Zhang et al., Citation2019).

To solve the above problems, some researchers have proposed a new algorithm, called few-shot learning, which can learn the universal laws of things from a limited number of examples and apply them to solve new problems. In few-shot learning, applying the learned prior knowledge to the unknown domains is the key to the success of the algorithm (Dong & Xing, Citation2018; Lang et al., Citation2022, Citation2023; Liu, Liu, Yao, et al., Citation2022; Zhao et al., Citation2017).

The mainstream framework for few-shot learning is shown in Figure . First, the feature extraction network extracts support image features $F_{s}$ and query image features $F_{q}$ with shared weights. Then $F_{s}$ and $F_{q}$ are input into the association mapping learning module, in which the correlation map between $F_{s}$ and $F_{q}$ is established. In the final stage, new query image features are generated under the association mapping guidance module and input to the target network for segmentation. In few-shot image segmentation, the key problem is to learn the relationship between support images and query images (Ding et al., Citation2023).

Figure 1. The mainstream framework of few-shot learning.

Currently, there are two ways to obtain guidance information in the association mapping guidance module: kinship learning and prototype feature learning (Deng et al., Citation2009; Zhang et al., Citation2019; Zhao et al., Citation2017). Specifically, kinship learning can directly match the pixel relationship between the support image and the query image. And use the pixel relationship as prior knowledge to improve the performance of image segmentation (Lu et al., Citation2023; Sung et al., Citation2018). However, kinship learning has poor generalisation ability. Prototype learning compresses the mask information of the supporting image into a support prototype. Compared with kinship learning, prototype learning is more robust. But it is easy to ignore the critical information in the image (Zhang et al., Citation2019). In addition (Liu, Wang et al., Citation2020), if the target object in the support image is significantly different from the target object in the query image, neither kinship learning nor prototype feature learning can establish an accurate association mapping guidance module (Chen et al., Citation2020; Gao, Fang, et al., Citation2022; Gao, Xiao, et al., Citation2022). To address these issues, some recent studies have proposed some new few-shot semantic segmentation methods, which all utilise the information in the support set to improve the segmentation performance of the target categories in the query image (Boudiaf et al., Citation2021; Lu et al., Citation2021).

In the process of establishing the association mapping guidance module, more accurate prior knowledge of the support image is needed to provide more valuable prior knowledge for the query image segmentation (Zhao et al., Citation2017). However, the prior knowledge in association mapping learning is too specific, which reduces the generalisation ability of the model.

In this paper, a multi-level fuzzy clustering guidance module is designed innovatively to build association mapping guidance. Instead of using a single scale image, the module retrieves multi-level information of the query image in four different sizes. And, the idea of clustering is introduced to obtain pixel blocks with similar attributes in the image, which are used as prototypes for subsequent operations. In addition, in order to fully utilise and integrate features of different scales, a cross-scale feature fusion module is proposed. This module can mine the correlation between query image features and support image features at different levels, resulting in reduced the loss of the image information caused by cross-scale transmission.

Experiments on Pascal (Vinyals et al., Citation2016) dataset and COCO dataset (Lin et al., Citation2014) illustrate that the proposed network not only performs well in one-shot segmentation, but also can be easily extended to k-shot segmentation experiments. It is worth mentioning that the transition to k-shot does not incur any additional computation cost.

The main contributions of this paper can be summarised in three aspects.

The idea of fuzzy clustering is introduced to extract the image features from a category-independent feature space, which provides prior knowledge for query image segmentation.
A multi-level fuzzy clustering guidance module is proposed to grasp the support prototype from four spatial dimensions. It can reduce the negative impact caused by the inconsistency between the target objects in the support image and the query image.
Cross-scale feature fusion module is designed to fully use and integrate feature information of different scales and reduce the loss caused by information flow.

The rest of this paper is organised as follows. In the following section, we introduce related fields and recent research works. Section “Proposed method” introduces the structure of the proposed model in detail. Section “Experiment” presents the experimental results on different datasets. Finally, the current work is summarised and future research directions are briefly described.

Related work

Image segmentation

Image segmentation is the basis of computer vision, image analysis and other related fields. In the last decade, significant progress has been made in image segmentation due to the emergence of fully convolutional network (FCN) (Long et al., Citation2015) and its variants, including SegNet (Badrinarayanan et al., Citation2017), UNet (Ronneberger et al., Citation2015), RefineNet (Lin et al., Citation2017) and others. Recently, multi-scale feature aggregation or attention mechanism has brought new technological breakthroughs for image segmentation (Yao et al., Citation2020; Yu et al., Citation2022). However, all models face the same limitations of data dependency. The volume of the training set and the accuracy of pixel-level labelling are the key factors that affect the model’s performance directly. Also, traditional models cannot identify new categories in the application phase that do not exist in the training set.

Few-shot learning for image segmentation

Different from the above methods, few-shot learning can apply the prior knowledge to predict new image categories. In image segmentation, the task of few-shot learning is to predict the label corresponding to each pixel, which is more challenging than image classification (Liu et al., Citation2021).

In Dong and Xing (Citation2018), the idea of prototype learning was first proposed to guide image segmentation by measuring the similarity between the prototype and the query pixel. After that, many algorithms are investigated to obtain the prototype. For example, the similarity guidance network (SGOne) (Zhang et al., Citation2020) takes the average pooling of masks as a prototype. Till now, many new network structures for few-shot image segmentation have been proposed. For example, the learning receptive fields along the depth-axis for rgb-d scene parsing (CRNet) (Xing et al., Citation2020) proposes a cross-reference module to mine the common feature information between the support image and the query image. Pyramid graph networks (PGNet), bridging the intra-class and inter-class gaps in one-shot segmentation (BriNet) and democratic attention network (DAN) (Yang, Wang, et al., Citation2020; Zhang, Chen, et al., Citation2019; Zhang, Sun, et al., Citation2021) introduces dense pixel-to-pixel connections between the support image and the query image. The prototype mixture models (PMM) (Zhang, Wang, et al., Citation2021) uses expectation maximisation algorithm to generate multiple support prototypes to solve the problem that a single support prototype is insufficient to describe image features.

Fuzzy clustering

The traditional image segmentation method has little dependence on data and are strong interpretability (Koch et al., Citation2015; Wang et al., Citation2020). Even amid the crazy for deep learning, traditional machine learning methods are still hot (Hartigan & Wong, Citation1979).

Traditional pixel correlation measurement methods mainly include hard correlation and soft correlation. Hard correlation cannot be integrated into modern deep networks due to its non-differentiable nature, such as K-means (Hartigan & Wong, Citation1979). The emergence of fuzzy set extends the traditional set theory and provides a new tool to deal with soft correlation between sets (Papandreou et al., Citation2015). On the basis of fuzzy set theory, the fuzzy C-means (FCM) algorithm with simple principle and low complexity was proposed by DUNN (Bezdek et al., Citation1984).

Definition of problem

General terms are formally defined in this subsection. The support image set is represented as $S$ and the query image set is represented as $Q$ . The number of labels and inputs in deep embedding learning is limited, and this model is also known as N-way K-shot classification (N-way: number of classes in a support set, K-shot: K examples in each category). The support image set is formalised as $S = { (x_{1, 1}, c_{1, 1}),(x_{1, 2}, c_{1, 2}), \dots,(x_{N, K}, c_{N, K})}$ and the query image set is as $Q = {q_{1}, \dots, q_{N \times M}}$ , in which $x_{i, j}$ represents the example $j$ in category $i$ , $c_{i, j} = { 1,2 \dots,N}$ . The train set is formalised as $S_{train} = { (I_{s / q}, M_{s / q})}$ and the test set is as $S_{test} = { (I_{s / q}, M_{s / q})}$ , meeting the requirement $S_{train} \cap S_{test} = \emptyset$ . $I \in R^{H \times W \times 3}$ is the RGB image and $M \in R^{H \times W}$ is the mask image.

Industrial significance

This paper proposes a method based on few-shot learning segmentation, which is a technique for segmenting images with limited annotated data. This technique has many important applications in various domains, such as medical image analysis, autonomous driving, machine vision, etc. For instance, tumours, tissues and cells in medical images can be segmented using only a few samples by this method, which can improve the accuracy and efficiency of diagnosis.

Proposed method

Different from the existing few-shot methods, which compress the query image into a supporting prototype and use a single-layer information fusion structure, the proposed model aims to grasp the multi-level information in four different sizes, breaking the limitations of a single scale. Besides, the proposed model introduces the idea of clustering to obtain pixel blocks with similar properties as supporting prototypes. Also, the model can reduce the loss of image information caused by cross-scale transmission effectively.

The overall architecture of the proposed network model is presented in Figure .

Figure 2. The overall architecture of the model.

Firstly, support image features and query image features are retrieved by feature extraction network with shared weights. The support image features (after pre-processing) and the query image features are input into the multi-level fuzzy clustering guidance module (MLFCG), in which prior knowledge with different levels can be retrieved from support image. Under the guidance of prior knowledge, the MLFCG module generates four new query image features. Next, to minimise the loss of image feature information and increase the resolution of the output image, four new query images are put into the cross-scale feature fusion module (CSFF). In the final stage, the feature information output by CSFF is input into the target network, in which $1 \times 1$ convolution is performed for pixel classification.

Multi-level fuzzy clustering guidance module

The idea of fuzzy clustering is introduced in multi-level fuzzy clustering guidance module (MLFCG) to measure the soft correlation between pixels. The purpose is to obtain the cluster centre as a prototype.

As shown in Figure , the model uses four adaptive average pooling to get query features at four scales ( $60 \times 60$ , $30 \times 30$ , $15 \times 15$ and $8 \times 8$ ). Hence, the spatial position information is first added to the support image, and the size of the support feature is changed from $H \times W \times C$ to $H \times W \times (C + 2)$ . In addition, to improve the accuracy of clustering and reduce the amount of computation, down-sampling is used to obtain the support mask to filter out the background information. Then, the feature map is compressed from $H \times W \times C$ to ( $N_{m}, C + 2$ ), where $N_{m}$ is the number of pixels of the corresponding mask image of the support image. The feature vectors of ( $N_{m}, C + 2$ ) are input into the Cluster Guiding Prototype (CGP) to calculate the support prototype. After four parallel CGPs, four-level support prototypes ( ${\tilde{F}}_{s 1}$ , ${\tilde{F}}_{s 2}$ , ${\tilde{F}}_{s 3}$ and ${\tilde{F}}_{s 4}$ ) corresponding to query features are obtained.

Figure 3. The multi-level fuzzy cluster guidance-extract prototype.

Next, the four support prototypes obtained from fuzzy clustering are adopted as prior knowledge to query features. As shown in Figure , query features and support prototypes are input into the Prototype Guiding Allocation (PGA). Under the guidance of support prototype with different levels, four new query features of different scales are generated. Next, CGP and PGA are explained in detail.

Figure 4. The multi-level fuzzy clustering guidance-use prototype.

Cluster guiding prototype (CGP)

This paper introduces the idea of clustering to aggregate image features with similar properties as a supporting prototype, which can provide more useful prior knowledge for subsequent operations. The specific implementation of this part is in Cluster Guiding Prototype (CGP). First, the random initialisation supports the original location of the prototype on space. Set the number of supported prototypes to $C$ , and the corresponding objective function is shown as follows: (1) $F = \sum_{i = 1}^{C} \sum_{j = 1}^{n} u_{ij}^{m} d^{2} (x_{j}, v_{i})$ (1) The objective function in equation (1) is under the constraints $\sum_{i = 1}^{c} u_{ij} = 1$ for all pixels, where $n$ is the number of pixels, and $u_{ij}$ is the fuzzy membership degree of pixel $j$ to support prototype $i$ . $d$ is the metric to measure the distance, defined as follows. (2) $d = \sqrt{{(d_{f})}^{2} + {(d_{s} / r)}^{2}}$ (2) where $d_{f}$ is the Euclidean distance of image feature information, $d_{s}$ is the Euclidean distance of image spatial location information and $r$ is the balance factor.

The unconstrained objective function constructed by the Lagrange multiplier method is shown in equation (3): (3) $J = \sum_{i = 1}^{C} \sum_{j = 1}^{n} u_{ij}^{m} d^{2} (x_{j}, v_{i}) + \sum_{j = 1}^{n} λ_{j} (\sum_{i = 1}^{C} u_{ij} - 1)$ (3) where $λ = [λ_{1}, λ_{2}, \dots, λ_{n}]^{T}$ is vector parameter in Lagrange Multiplier Method (LMM). Taking the derivative of $J$ with respect to $u_{ij}$ , and setting $\frac{\partial J}{\partial u_{ij}} = 0$ , we will retrieve the optimal value of $u_{ij}$ in minimising the Equation (3) as follows. (4) $\frac{\partial J}{\partial u_{ij}} = m u_{ij}^{m - 1} d^{2} (x_{j}, v_{i}) + λ_{j} = 0$ (4) From equation (4), we obtain $u_{ij} = {(\frac{- λ_{j}}{m d^{2} (x_{j}, v_{i})})}^{\frac{1}{m - 1}}$ . Because of $\sum_{i = 1}^{C} u_{ij} = \sum_{i = 1}^{C} {(\frac{- λ_{j}}{m d^{2} (x_{j}, v_{i})})}^{\frac{1}{m - 1}} = 1$ , we can infer ${(\frac{- λ_{j}}{m})}^{\frac{1}{m - 1}} = 1 / \sum_{i = 1}^{C} \frac{1}{d {(x_{j}, v_{i})}^{2 / m - 1}}$ . Hence, the fuzzy membership degree of image pixels $j$ with respect to prototype $i$ can be calculated as equation (5): (5) $u_{ij} = \frac{1}{\sum_{k = 1}^{C} {(\frac{d (x_{j}, v_{i})}{d (x_{j}, v_{k})})}^{\frac{2}{m - 1}}}$ (5) Taking the derivative of $J$ with respect to $v_{i}$ as Equation (6), and setting $\frac{\partial J}{\partial v_{i}} = 0$ , we will retrieve the optimal value of in minimising Equation (5) as follows. (6) $\frac{\partial J}{\partial v_{i}} = - 2 \sum_{j = 1}^{n} u_{ij}^{m} (x_{j} - v_{i}) = 0$ (6) The update equation of support prototype $v_{i}$ is retrieved as follows. (7) $v_{i} = \frac{\sum_{j = 1}^{n} u_{ij}^{m} x_{j}}{\sum_{j = 1}^{n} u_{ij}^{m}}$ (7) The whole CGP process is delineated in Algorithm 1.

Table

Display Table

Prototype guiding allocation (PGA)

PGA adopts the supporting prototype to guide the query image to generate new query image features, presented in Figure .

Figure 5. The structure of prototype guiding allocation.

In PGA, the correlation is measured by the Cosine value between the support prototype and the query image feature. (8) ${C_{i}}^{x, y} = \frac{v_{i} \cdot {F_{q}}^{x, y}}{| | v_{i} | | \cdot | | {F_{q}}^{x, y} | |} i \in {1, 2, \dots, C}$ (8) where ${F_{q}}^{x, y} \in R^{c \times 1}$ is the feature of the query image at position $(x, y)$ . The location index of the most associated support archetypes is determined by Equation (9). (9) $G^{x, y} = \underset{i \in {0, \dots, C}}{\arg max} {C_{i}}^{x, y}$ (9) Next, the guide map $G^{x, y} \in R^{h \times w}$ is recombined with the index values corresponding to the support prototype to get the location index table of the query image, denoted as $F_{G} \in R^{c \times h \times w}$ . In the last layer of the clustering guidance module, the probability map $P$ is obtained by summing the similarity. At the end of the model, the final query image feature ${\tilde{F}}_{q}$ are obtained by concatenating the probability map, the cluster index table and the original image features. (10) ${\tilde{F}}_{q} = f (F_{q} \oplus F_{G} \oplus P)$ (10)

Cross-scale feature fusion

To obtain a more reliable support prototype, the MLFCG method is proposed in this paper. In MLFCG, we use four prior knowledge to guide query images of four size levels, and finally generate four new query images. To fuse these four new query images, this paper proposes a Cross-scale feature fusion model (CSFF).

CSFF can make full use of and integrate feature information of different scales to reduce the loss caused by information flow. The structure of CSFF is shown in Figure .

Figure 6. The structure of cross-scale feature fusion.

As shown in Figure , four new query images of different sizes are input into the CSFF. Firstly, three new query image features with larger sizes are down-sampling. And the image features obtained by down-sampling are added to the upper image feature layer by channel connection. CSFF maintains a higher level of image feature detail by fusing different size image features. This feature fusion method integrates the features of different scales longitudinally and extracts the depth information of each scale feature horizontally. Next, we use $8 \times 8$ convolution operations at each layer for information fusion.

Image features are connected at the end of the model to obtain the fused query image features. In addition, to further complement the image details, we concatenate the original query image features with the fused query image features.

Loss function

The loss function of the proposed model includes the cross entropy between the result and the reference image, and four cross-entropy losses of different scales, formalised as follows. (11) $L = L_{1} + \frac{σ}{n} \sum_{i = 1}^{n} L_{2}^{i} {i \in 1, 2, \dots 4}$ (11) where $L_{1}$ is the cross entropy between the segmentation result image and the reference image, and $L_{2}^{i}$ is the cross entropy between the query image of different scales and the reference ground truth, which is computed in the cross-scale feature fusion module. $σ$ is the balance parameter, set to 1.0 in the proposed model.

Experiment

Datasets and evaluation metric

To verify the performance of the model, we used two datasets, the Pascal dataset and the COCO dataset. It is well known that Pascal consists of extensions of the Pascal VOC 2012 (Tian et al., Citation2022) and SBD (Nguyen and Todorovic, Citation2019) datasets, while COCO is well-known for its higher segmentation difficulty.

In the experiments, the 20 classes in the Pascal dataset were divided into four parts, each part containing five categories. The model is trained by cross-validation. Three parts are selected to train the model, and the remaining part is used to verify the effect of the model. In the test phase, 1000 support-query pairs were randomly sampled. Also, the 80 classes in the Pascal dataset are divided into four parts, each with 20 classes. The COCO validation set contains 40,137 images (80 classes). Because the number of images in COCO is much larger than that in Pascal, the 1000 query-support pairs used in previous work are insufficient. Therefore, during the evaluation, 20,000 support-query pairs were randomly selected to ensure reliable results.

In image segmentation, IoU (intersection-over-union) is often used as an evaluation index. In general, the higher the IoU, the higher the accuracy. When the number of classes in the training samples is unbalanced (for example, 49 sheep samples and 378 human samples), the cross-ratio cannot describe the performance accurately. To avoid the influence of class imbalance, mean intersection over union (mIoU) was adopted, which was defined as follows: (12) $\begin{aligned} IoU & = \frac{T P}{T P + F P + FN} \end{aligned}$ (12) (13) $\begin{aligned} mIoU & = \frac{1}{n_{l}} \sum_{l} Io U_{l} \end{aligned}$ (13) where $TP, FP, FN$ is the number of true positive, false positive and false negative pixels and $n_{l}$ is the number of category objects.

In addition, when the proportion of the area occupied by the target object in the overall image is small, the IoU value cannot evaluate the performance accurately. Therefore, the foreground and background values (FB-IoU) must be considered when evaluating the performance.

Implementation details

In the experiment, ResNet is adopted as the backbone network, combined with SGD optimiser. When Pascal dataset is trained, the initial learning rate is set to 0.0025 and the batch size is set to 4. For COCO dataset, the initial learning rate is set to 0.005 and the batch size is set to 8. The number of supported prototypes is initialised to 5.

Before training, image enhancement operations such as horizontal flip and rotation transformation are performed, and all image sizes are unified to $473 \times 473$ (Pascal) and $641 \times 641$ (COCO). The runtime environment is Pytorch, and the model run on an Nvidia-SMI GPU.

Ablation study

In the field of few-shot image segmentation, the information obtained after average pooling is often used as a supporting prototype. In order to verify that the support prototypes obtained by GPA module can guide query image segmentation more effectively, this paper designs a model B1 as a comparative experiment. To increase the persuasion of the experiment, the B1 model is almost identical to the proposed model, except for replacing the method of obtaining the prototype support from GPA to average pooling.

Ablation studies were performed on the Pascal datasets, as tabulated in Table , in which the column labelled Δ denotes the difference between 1-shot and 5-shot. In Table , the guidance prototype obtained under the guidance of fuzzy clustering can better grasp the image features better and can retrieve better mIoU values. The comparison means that the mask information of the supporting image is condensed into a supporting prototype through average pooling, the feature information in the supporting image is easily lost and the prior knowledge cannot be extracted.

Table 1. mIoU comparison of different models in Ablation study.

Download CSV Display Table

In order to confirm that the support prototype in four spatial dimensions can reduce the negative impact caused by the inconsistent target objects in the support-query image, a B2 model is designed, in which CGP and PGA are simplified into a single structure. In B2, the corresponding support prototype is only learned from the original support image, and the query image features fused with guidance information are obtained by single-layer PGA and directly input into the target network. As shown in Table , the mIoU of the proposed model is almost the best, which means that the proposed model can achieve the best performance whether the background network is ResNet50 or ResNet101.

In summary, the cross-scale feature fusion module (CSFF) can significantly improve the image segmentation performance, compared with the single-scale model structure.

Comparison to state-of-the-art models

Pascal

The proposed model is compared with related methods separately on the Pascal dataset, and the results are presented in Table . According to the mIoU values, this model outperforms or compares the existing methods in the case of 1-shot and 5-shot in the same background network. It is worth mentioning that in the case of 5-shot on the Pascal dataset, the proposed model outperforms the state-of-the-art methods by an average of 3 percentage points. To increase the persuasiveness of the experimental results, mIoU and FB-IoU are adopted as indicators to evaluate the proposed model and state-of-the-art algorithms. According to the data in Table , the proposed model can almost achieve the highest level of image segmentation results regardless of which evaluation index is used.

Table 2. mIoU Comparison of the proposed model with state-of-the-art methods on Pascal.

Download CSV Display Table

Table 3. Comparison of FB-IoU on Pascal.

Download CSV Display Table

In order to show the performance more vividly, five sample images of different categories (including cows, cars, cats, sofas and birds) are presented in Figure . In Figure , the first two rows are the supporting image and the corresponding ground truths, the last two rows are the query image and the corresponding ground truths, and the last row presents the results of the proposed model. From Figure , we can see that the proposed model can achieve satisfying results even if there is a large difference between the support image and the query image. The results show that the proposed model is reasonable and reliable.

Figure 7. Image segmentation results on Pascal.

COCO

Compared with Pascal, COCO set has a larger amount of data and a more complex situation, hence almost all algorithms perform poor on COCO dataset. Table shows that the proposed model still has a significant advantage over the state-of-the-art algorithms. As is shown, the proposed model outperforms the state-of-the-art methods in both 1-shot and 5-shot settings while using fewer parameters. And the column labelled Δ denotes the difference between 1-shot and 5-shot. The performance on 5-shot is better on 1-shot. The FB-IoU (Yang, Wang, et al., Citation2020) value of the corresponding models on COCO dataset are presented in Table . Compared with other algorithms, the proposed model can improve by at least one percentage point and even up to five percentage point in the value of FB-IoU. In addition, more accurate results can be obtained with ResNet101, and 5-shot setting generally provides more accuracy than the 1-shot setting.

Table 4. Comparison with state-of-the-arts on COCO with per-split results.

Download CSV Display Table

Table 5. Comparison of FB-IoU on COCO.

Download CSV Display Table

In Figure , the support-query image pair (including the category of cows, giraffes, humans, airplanes, surfers) is selected to show the advantages of the proposed model. The images in Figure are arranged in the same order as in Figure . From Figure , the results of the proposed algorithm are satisfactory.

Figure 8. Image segmentation results on dataset COCO.

Discussion

Our method is designed for few-shot learning segmentation, which is a technique to segment images when there is a scarcity of high-precision annotated data. However, our method also faces some challenges and limitations. On one hand, our method relies on a fuzzy clustering algorithm to extract image features, but this algorithm requires pre-setting the number of clusters and the membership degree threshold, which may affect the clustering outcomes and the segmentation quality. On the other hand, our method uses image features of different scales to fuse with prior knowledge, but this also increases the computation and memory consumption, which may reduce the segmentation speed and efficiency.

Despite these limitations, our method still has some unique advantages and innovations. We believe that our method has great potential for improvement and application in the future.

Conclusions

Aiming at the problem that the existing supporting prototype cannot fit the image features accurately and lose more information in the single-level flow of image features, this paper proposed a multi-level clustering module and a cross-scale feature fusion module. Specifically, the multi-level clustering module is based on the idea of fuzzy clustering, and the feature vectors with similar properties in support images are used as support prototypes. The cross-scale feature fusion module can reduce the information loss caused by information flow between single layers and improve the accuracy of image segmentation. Extensive experiments and ablation experiments on both Pascal and COCO demonstrate the superiority of the proposed model, which means that the proposed model can achieve better performance than the current state-of-the-art algorithms.

Acknowledgements

The authors also gratefully acknowledge the reviewers’ helpful comments and suggestions, which will improve the presentation significantly.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research was funded by National Natural Science Foundation of China NSF of China [grant numbers 62007017, 61873117, U22A2033, 62171209, 62176140] and Basic Research Project of Yantai Science and Technology Innovation Development Plan (2023JCYJ044).

References

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495. https://doi.org/10.1109/TPAMI.2016.2644615
PubMed Web of Science ®Google Scholar
Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2–3), 191–203. https://doi.org/10.1016/0098-3004(84)90020-7
Web of Science ®Google Scholar
Boudiaf, M., Kervadec, H., Masud Z. I., Piantanida, P., Ayed, I.B., & Dolz, J. (2021). Few-shot segmentation without meta-learning: A good transductive inference is all you need?. In Michael S. Brown (Ed.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13979–13988). IEEE.
Google Scholar
Chen, J., Ying, H., Liu, X., Gu, J., Feng, R., Chen, T., Gao, H., & Wu, J. (2020). A transfer learning based super-resolution microscopy for biopsy slice images: The joint methods perspective. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(1), 103–113. https://doi.org/10.1109/TCBB.2020.2991173
Web of Science ®Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Daniel Huttenlocher (Ed.), IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.
Google Scholar
Ding, H., Zhang, H., & Jiang, X. (2023). Self-regularized prototypical network for few-shot semantic segmentation. Pattern Recognition, 133, 109018. https://doi.org/10.1016/j.patcog.2022.109018
Web of Science ®Google Scholar
Dong, N., & Xing, E. P. (2018). Few-shot semantic segmentation with prototype learning. BMVC, 3(4).
Google Scholar
Eltouny, K., Gomaa, M., & Liang, X. (2023). Unsupervised learning methods for data-driven vibration-based structural health monitoring: A review. Sensors, 23(6), 3290. https://doi.org/10.3390/s23063290
PubMed Web of Science ®Google Scholar
Gairola, S., Hemani, M., Chopra, A., & Krishnamurthy, B. (2020). Simpropnet: Improved similarity propagation for few-shot image segmentation. arXiv preprint arXiv:2004.15014.
Google Scholar
Gao, H., Fang, D., Xiao, J., Hussain, W., & Kim, J. Y. (2022). CAMRL: A joint method of channel attention and multidimensional regression loss for 3D object detection in automated vehicles. IEEE Transactions on Intelligent Transportation Systems, 24(8), 8831–8845. https://doi.org/10.1109/TITS.2022.3219474
Web of Science ®Google Scholar
Gao, H., Xiao, J., Yin, Y., Liu, T., & Shi, J. (2022). A mutually supervised graph attention network for few-shot segmentation: The perspective of fully utilizing limited samples. In Song Yongduan (Ed.), IEEE transactions on neural networks and learning systems. IEEE.
Google Scholar
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.
Google Scholar
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. ICML Deep Learning Workshop, 2(1).
Google Scholar
Lang, C., Cheng, G., Tu, B., & Han, J. (2022). Learning what not to segment: A new perspective on few-shot segmentation. In Rama Chellappa (Ed.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8057–8067). IEEE.
Google Scholar
Lang, C., Cheng, G., Tu, B., Li, C., & Han, J. (2023). Base and meta: A new perspective on few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 10669–10686. https://doi.org/10.1109/TPAMI.2023.3265865
PubMed Web of Science ®Google Scholar
Li, G., Jampani, V., Sevilla-Lara, L., Sun, D., Kim, J., & Kim, J. (2021). Adaptive prototype learning and allocation for few-shot segmentation. In Michael S. Brown (Ed.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8334–8343). IEEE.
Google Scholar
Lin, G., Milan, A., Shen, C., & Reid, I. (2017). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Rama Chellappa (Ed.), Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1925–1934). IEEE.
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In David Fleet (Ed.), Proceeding of ECCV 2014: 13th european conference (pp. 740–755). Springer Cham.
Google Scholar
Liu, H., Wang, H., Wu, Y., & Xing, L. (2020). Superpixel region merging based on deep network for medical image segmentation. ACM Transactions on Intelligent Systems and Technology, 11(4), 1–22. https://doi.org/10.1145/3386090
PubMed Web of Science ®Google Scholar
Liu, R., Ma, L., Zhang, J., Fan, X., & Luo, Z. (2021). Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Michael S. Brown (Ed.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10561–10570). IEEE.
Google Scholar
Liu, Y., Liu, N., Cao, Q., Yao, X., Han, J., & Shao, L. (2022). Learning non-target knowledge for few-shot semantic segmentation. In Rama Chellappa (Ed.), Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11573–11582). IEEE.
Google Scholar
Liu, Y., Liu, N., Yao, X., & Han, J. (2022). Intermediate prototype mining transformer for few-shot semantic segmentation. arXiv preprint arXiv:2210.06780.
Google Scholar
Liu, Y., Zhang, X., Zhang, S., & He, X. (2020). Part-aware prototype network for few-shot semantic segmentation. In Andrea Vedaldi (Ed.), Proceeding of computer vision–ECCV (pp. 142–158). Springer.
Google Scholar
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Cordelia Schmid (Ed.), Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440). IEEE.
Google Scholar
Lu, S., Zhao, H., Liu, H., Li, H., & Wang, N. (2023). PKRT-Net: Prior knowledge-based relation transformer network for optic cup and disc segmentation. Neurocomputing, 538, 126183. https://doi.org/10.1016/j.neucom.2023.03.044
Web of Science ®Google Scholar
Lu, Z., He, S., Zhu, X., Zhang, L., Song, Y.-Z., & Xiang, T. (2021). Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In Tamara Berg (Ed.), Proceedings of the IEEE/CVF international conference on computer vision (pp. 8741–8750). IEEE.
Google Scholar
Nguyen, K., & Todorovic, S. (2019). Feature weighting and boosting for few-shot segmentation. In Kyoung Mu Lee (Ed.), Proceedings of the IEEE/CVF international conference on computer vision (pp. 622–631). IEEE.
Google Scholar
Papandreou, G., Chen, L.-C., Murphy, K. P., & Yuille, A. L. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Ruzena Bajcsy (Ed.), Proceedings of the IEEE international conference on computer vision (pp. 1742–1750). IEEE.
Google Scholar
Rakelly, K., Shelhamer, E., Darrell, T., Efros, A., & Levine, S. (2018). Conditional networks for few-shot semantic segmentation.
Google Scholar
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab (Ed.), Proceeding of medical image computing and computer-assisted intervention–MICCAI (pp. 234–241). Springer Cham.
Google Scholar
Shaban, A., Bansal, S., Liu, Z., Essa, I., & Boots, B. (2017). One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410.
Google Scholar
Siam, M., Oreshkin, B., & Jagersand, M. (2019). Adaptive masked proxies for few-shot segmentation. arXiv preprint arXiv:1902.11123.
Google Scholar
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In Michael Brown (Ed.), Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1199–1208). IEEE.
Google Scholar
Tian, Z., Zhao, H., Shu, M., Yang, Z., Li, R., & Jia, J. (2022). Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2), 1050–1065. https://doi.org/10.1109/TPAMI.2020.3013717
PubMed Web of Science ®Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., & Wierstra, D. (2016). Matching networks for one shot learning. In Daniel D. Lee (Ed.), Advances in neural information processing systems, Vol. 29. MIT Press.
Google Scholar
Wang, H., Zhang, X., Hu, Y., Yang, Y., Cao, X., & Zhen, X. (2020). Few-shot semantic segmentation with democratic attention networks. In Vittorio Ferrari (Ed.), Proceeding of computer vision–ECCV (pp. 730–746). Springer.
Google Scholar
Wang, K., Liew, J. H., Zou, Y., Zhou, D., & Feng, J. (2019). Panet: Few-shot image semantic segmentation with prototype alignment. In Kyoung Mu Lee (Ed.), Proceedings of the IEEE/CVF international conference on computer vision (pp. 9197–9206). IEEE.
Google Scholar
Xing, Y., Wang, J., & Zeng, G. (2020). Malleable 2.5 D convolution: Learning receptive fields along the depth-axis for RGB-D scene parsing. In Vittorio Ferrari (Ed.), Proceeding of computer vision–ECCV (pp. 555–571). Springer.
Google Scholar
Yang, B., Liu, C., Li, B., Jiao, J., & Ye, Q. (2020). Prototype mixture models for few-shot semantic segmentation. In Vittorio Ferrari (Ed.), Proceeding of computer vision–ECCV (pp. 763–778). Springer.
Google Scholar
Yang, X., Wang, B., Chen, K., Zhou, X., Yi, S., Ouyang, W., & Zhou, L. (2020). Brinet: Towards bridging the intra-class and inter-class gaps in one-shot segmentation. arXiv preprint arXiv:2008.06226.
Google Scholar
Yao, T., Kong, X., Fu, H., & Tian, Q. (2020). Discrete semantic alignment hashing for cross-media retrieval. IEEE Transactions on Cybernetics, 50(12), 4896–4907. https://doi.org/10.1109/TCYB.2019.2912644
PubMed Web of Science ®Google Scholar
Yu, X., Liu, H., Lin, Y., Wu, Y., & Zhang, C. (2022). Auto-weighted sample-level fusion with anchors for incomplete multi-view clustering. Pattern Recognition, 130, 108772. https://doi.org/10.1016/j.patcog.2022.108772
Web of Science ®Google Scholar
Zhang, C., Lin, G., Liu, F., Guo, J., Wu, Q., & Yao, R. (2019). Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Kyoung Mu Lee (Ed.), Proceedings of the IEEE/CVF international conference on computer vision (pp. 9587–9595). IEEE.
Google Scholar
Zhang, C., Lin, G., Liu, F., Yao, R., & Shen, C. (2019). Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Larry Davis (Ed.), Proceedings of the IEEE/CVF conference on computer version and pattern recognition (pp. 5217–5226). IEEE.
Google Scholar
Zhang, F., Chen, Y., Li, Z., Hong, Z., Liu, J., Ma, F., & Ding, E. (2019). Acfnet: Attentional class feature network for semantic segmentation. In Kyoung Mu Lee (Ed.), Proceedings of the IEEE/CVF international conference on computer vision (pp. 6798–6807). IEEE.
Google Scholar
Zhang, G., Kang, G., Yang, Y., & Wei, Y. (2021). Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 34, 21984–21996.
Google Scholar
Zhang, X., Sun, Y., Liu, H., Hou, Z., Zhao, F., & Zhang, C. (2021). Improved clustering algorithms for image segmentation based on non-local information and back projection. Information Sciences, 550, 129–144. https://doi.org/10.1016/j.ins.2020.10.039
Web of Science ®Google Scholar
Zhang, X., Wang, H., Zhang, Y., Gao, X., Wang, G., & Zhang, C. (2021). Improved fuzzy clustering for image segmentation based on a low-rank prior. Computational Visual Media, 7(4), 513–528. https://doi.org/10.1007/s41095-021-0239-3
Google Scholar
Zhang, X., Wei, Y., Yang, Y., & Huang, T. S. (2020). SG-One: Similarity guidance network for one-shot semantic segmentation. IEEE Transactions on Cybernetics, 50(9), 3855–3865. https://doi.org/10.1109/TCYB.2020.2992433
PubMed Web of Science ®Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Rama Chellappa (Ed.), Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881–2890). IEEE.
Google Scholar

Few-shot segmentation based on multi-level and cross-scale clustering

Abstract

Introduction