Full article: An unsupervised semantic segmentation method that combines the ImSE-Net model with SLICm superpixel optimization

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In the field of remote sensing, using a large amount of labeled image data to supervise the training of fully convolutional networks for the semantic segmentation of images is expensive. However, using a small amount of labeled data can lead to reduced network performance. This paper proposes an unsupervised semantic segmentation method that combines the ImSE-Net model with SLICm superpixel optimization. First, the ImSE-Net model is used to extract semantic features from the image to obtain rough semantic segmentation results. Then, the SLICm superpixel segmentation algorithm is used to segment the input image into superpixel images. Finally, an unsupervised semantic segmentation model (UGLS) is used to combine high-level abstract semantic features with detailed information on superpixels to obtain edge-optimized semantic segmentation results. Experimental results show that compared with other semantic segmentation algorithms, our method more effectively handles unbalanced areas, such as object boundaries, and achieves better segmentation results, with higher semantic consistency.

KEYWORDS:

1. Introduction

With the advancement of Earth observation satellite technology, high-spatial-resolution remote sensing images (HSRSIs) have found applications in various fields, including urban planning (Ahmadi et al. Citation2010), city environmental modeling (Liu et al. Citation2019a), ecological assessment (Su et al. Citation2023), land use change monitoring (Li, Huang, and Gong Citation2019), and digital city evolution (Huang, Cao, and Li Citation2020). HSRSIs are characterized by high spatial resolution, clarity, ready availability, and richness of information. They offer detailed insights into ground objects and the relationships between adjacent elements. Segmenting HSRSIs is a crucial step in image interpretation, as the quality of segmentation directly impacts the accuracy of subsequent analysis and processing. Therefore, research on HSRSI segmentation holds both theoretical and practical significance (Baatz and Schape Citation2000; Kotaridis and Lazaridou Citation2021; Li et al. Citation2021; Shao et al. Citation2022; Wang Citation2021a; Wang et al. Citation2022).

HSRSI semantic segmentation aims to assign category labels to individual pixels in an image, achieving pixel-level classification. This process is one of the fundamental and crucial steps in extracting and analyzing image information (Xu et al. Citation2021; Yuan, Shi, and Gu Citation2021). Gradually evolving into a pivotal research area within computer vision, HSRSI semantic segmentation has found extensive applications in autonomous driving, scene comprehension, and robot navigation. However, attaining high-accuracy semantic segmentation of high-resolution remote sensing images poses challenges, primarily stemming from the difficulty in extracting effective features owing to the presence of significant intra-class differences and subtle inter-class variations in these images. Additionally, conventional HSRSI semantic segmentation techniques heavily rely on classifiers integrated with manually extracted features, encompassing supervised, semi-supervised and unsupervised semantic segmentation methods.

The supervised and semi-supervised semantic segmentation algorithm necessitates an extensive training dataset that aligns consistently with the spectral texture characteristics of the targeted remote sensing image for classification. Furthermore, manual generation of high-quality training datasets presents challenges of low annotation efficiency and high costs. Hence, in scenarios in which suitable training data is lacking, opting for unsupervised depth classification becomes a reasonable choice. Compared with the supervised method, the unsupervised remote sensing image classification model does not require high-quality training data with real labels, consumes fewer computing resources and less time, and holds higher practical application value. In recent years, unsupervised semantic segmentation technology based on HSRSI deep learning has increasingly become a central research focus for scholars. This approach allows for the automatic extraction of information concerning low-level, intermediate, and high-level semantics of images through a network (Jiang et al. Citation2016). Subsequently, it incorporates classifier assistance for pixel classification using networks such as the fully convolutional network (FCN), U-Net, SegNet, and the DeepLab series (Chen et al. Citation2022; Wang et al. Citation2021b; Wei et al. Citation2022). The FCN accepts images of varying sizes and generates dense pixel-level predictions by eliminating fully connected layers (Sun and Wang Citation2018; Yang et al. Citation2020). Following this, various derivative models have been proposed and effectively applied to diverse semantic segmentation tasks (Alam et al. Citation2021; Jiang et al. Citation2020; Ma and Chang Citation2022).

Owing to limitations in computational resources, existing architectures often employ downsampling operations, such as convolution with a stride of two or more and max-pooling, to achieve a balance between accuracy and efficiency. These operations effectively reduce the size of feature maps while broadening the receptive field, thus reducing the computational complexity of the model and preserving comprehensive semantic information. However, the repeated stacking of pooling or downsampling layers can lead to the loss of detailed object information and the absence of local structures, resulting in subpar segmentation at the edges of objects (Du et al. Citation2021; Guo, Xu, and Liu Citation2021; Wang et al. Citation2022). To address this issue, numerous methods have been proposed to mitigate the loss of local information by replacing strides with dilated convolutions, but this requires adding extra layers (Yu and Koltun Citation2016). Establishing semantic relationships between adjacent features in HSRSIs is more challenging than in natural images with simpler backgrounds.

Superpixels significantly reduce computational complexity (Ren and Malik Citation2003). They capture semantic relationships between pixels, including texture and topological relationships between image objects, rather than focusing on individual pixel spectral information. This results in more accurate contour boundaries within local regions. Thus, the use of superpixels is a potential strategy for addressing the issue of inaccurate boundary segmentation and inconsistent prediction of sub-regions within objects, particularly in the context of high-resolution remote sensing imagery. Mainstream superpixel algorithms include simple linear iterative clustering (SLIC) (Achanta et al. Citation2012), linear spectral clustering (Ren, Zhai, and Sun Citation2023), and entropy rate superpixel (Wang et al. Citation2023). Researchers have integrated superpixel techniques into deep neural networks, leading to the development of various remarkable algorithms (Kwak, Hong, and Han Citation2017). Gadde et al. (Citation2015) demonstrated that implementing superpixel-based downsampling without altering the initial propagation path of the architecture significantly improved model performance. Zhao et al. (Citation2018) proposed combining superpixels with conditional random fields to optimize the semantic segmentation performance of the FCN deep learning model. Furthermore, superpixels are utilized as a post-processing technique. Merging the edges of the superpixels with the pre-segmentation outcome results in enhanced segmentation results (Lei et al. Citation2022). Utilizing the adjacency relationship between superpixels enhances the abstract degree of features and facilitates the transfer and update of features across multiple layers. This enables a better understanding of the semantic consistency between targets.

In conclusion, semantic image segmentation through supervised network training with a large volume of labeled image data results in high labeling costs, while relying on a limited amount of labeled data leads to reduced network performance. The present study introduces an unsupervised semantic segmentation model that combines ImSE-Net with the SLICm superpixel. The ImSE-Net extracts semantic features from the image and integrates them with the SLICm superpixel segmentation algorithm to obtain a coarse segmentation of the image. This unsupervised model, augmented with high-level abstract semantic features, enhances semantic consistency across patches, improves segmentation accuracy for irregular boundaries, and generates final semantic segmentation results.

The key innovations of this segmentation framework are as follows:

Through the integration of the ImSE-Net model with the SLICm superpixel optimization model and the replacement of strided convolution processing with superpixel-based downsampling, finer predicted images are obtained (i.e. improved prediction accuracy of the encoder–decoder architecture).
High-level convolutional features provide accurate semantic information, while superpixel edge features provide accurate spatial information; thus, the problems of inaccurate boundary segmentation and inconsistent predictions within objects in HSRSI semantic segmentation with deep learning are addressed, resulting in improved segmentation accuracy.
An innovative unsupervised semantic segmentation approach based on HSRSIs is introduced. The approach integrates the latent spatial relationships between global objects (GHPsP) and excavates fine-grained local edge features (LHPsP), aiming to effectively enhance semantic consistency within the global scope.

2. Methods

This paper proposes an unsupervised semantic segmentation model that combines the ImSE-Net model with the SLICm superpixel model (). The upper half of the red box represents the ImSE-Net model extracting semantic features of the image to obtain rough clustering results. The lower half of the red box represents the superpixel optimization component, which conducts superpixel optimization processing on the input image, yielding preprocessing semantic segmentation results. The green box symbolizes the utilization of an unsupervised semantic segmentation model that integrates high-level abstract semantic features and detailed superpixel information. This integration aims to achieve high-precision segmentation results with enhanced regional semantic consistency.

Figure 1. Illustration of the workflow for the proposed model. The pre-segmentation phase integrates the ImSE-Net model with SLICm superpixel optimization, generating a preliminary semantic segmentation result denoted by the red dashed line above. The UGLS algorithm for further refinement classification results is indicated by the lower green dashed frame. (a) Original image; (b) segmentation result using ImSE-Net; (c) contour refinement based on SLICm; (d) segmentation result combining ImSE-Net and SLICm; (e) final segmentation result.

2.1. Implicit superpixel embedding network (ImSE-Net)

To effectively utilize the properties of superpixels and reduce the loss of ground information, the ImSE-Net method was adopted (Liu et al. Citation2019b). This technique employs superpixel-based downsampling and uses sampled pixels as seeds to group pixels. It decodes low-resolution images into high-resolution ones based on the generated superpixels. Consequently, superpixels are specifically utilized to restore lost resolution and preserve detailed ground information. The ImSE-Net method primarily involves data preprocessing and clustering ().

Figure 2. Illustration of the workflow for the proposed ImSE-Net method. FCN-32 uses ResNet-101 as the backbone, in combination with the proposed ImSE-Net model from Conv2_x to Conv5_x. This method groups pixels from the downsampling layer to create a allocation matrix assigning predicted target values to each pixel group. Unlike existing neural network methods, this method does not explicitly employ superpixel segmentation for downsampling. Hence, the method can be integrated into existing architectures without altering their feedforward paths.

2.1.1. Preparatory

According to , FCN consists of convolutional layers, ReLU activation layers, and downsampling layers, among which the max-pooling layer and the convolutional layer comprise at least two layers (Shelhamer, Long, and Darrell Citation2017). Assuming $g^{(s)}$ represents the Conv+ReLU module corresponding to FCN-32 architecture, where s denotes an output stride, and $d^{(s)}$ represents the downsampling layer. Then, let $d^{(s)} : R^{\frac{HW}{S^{2}} \times N} \to R^{\frac{HW}{S^{2}} \times M}$ , with $s < s^{'}$ , indicating a reduction in the feature image resolution. The downsampling layer reduces the resolution from $(\frac{H}{s}, \frac{W}{s}) to (\frac{H}{2 s}, \frac{W}{2 s})$ .

Therefore, the downsampling layer is defined as $d (\cdot)$ , where the feature dimension of the downsampling feature image changes when strided convolution is used for downsampling. The prediction image generated by FCN-32 is (1) $y^{(32)} = \emptyset^{\circ} g^{(32)} \circ d^{\circ} \dots^{\circ} g^{(2)} \circ d^{\circ} g^{(1)} (X),$ (1) where $X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{H \times W}$ denotes the input data, with each image having a height H and width W; $\emptyset (\cdot)$ represents the output feature value of the mapping function, which is the direct mapping from the input image $X$ to $y^{(32)}$ .

2.1.2. Implicit clustering module

Grouping adjacent pixels in the downsampling layer and using all the pixels belonging to the same cluster as a common prediction can significantly reduce the model parameters and computational complexity while effectively preserving the spatial correlation of local pixels and improving the efficiency and accuracy of segmentation.

First, the representation of the j-th cluster with output stride s is given as (2) $c_{j}^{(s)} = {i | ∀k, S (W^{(s)} x_{i}^{(s)}, {\tilde{W}}^{(2 s)} s_{j}^{(2 s)}) \geq S (W^{(s)} x_{i}^{(s)}, {\tilde{W}}^{(2 s)} s_{k}^{(2 s)})},$ (2) where S is cosine similarity, and $s_{j}^{(2 s)}$ represents the j-th pixel in the downsampled feature image; $s^{(2 s)} = d (x^{(s)})$ . Therefore, $W^{(s)} \in R^{K \times N}$ and ${\tilde{W}}^{(2 s)} \in R^{K \times N}$ are learnable weight matrices for $x^{(s)} \in R^{\frac{HW}{s^{2}} \times N}$ and $s^{(2 s)} \in R^{\frac{HW}{4 s^{2}} \times M}$ , respectively, where feature vectors are mapped to a k-dimensional space; k refers to 64 in the experiment . Then, a set of pixels ${C^{(s)}}$ is generated at each resolution.

To effectively insert the clustering module into the downsampling layer, bilinear upsampling is used to decode the prediction back to the original resolution. The cluster to which the i-th pixel belongs can be obtained using $argma x_{k} S (W^{(s)} x_{i}^{(s)}, {\tilde{W}}^{(2 s)} s_{k}^{(2 s)})$ . To enable end-to-end training of the model, a temperature softmax function is applied. Compared with traditional clustering, this model allows for gradual optimization during the training process, making the probability of assigning each sample to different clusters more uniform, thereby preventing overfitting. The formula is as follows: (3) $A^{(s) *} = \arg max_{A^{(s)} \in {0, 1}^{U \times V}} \sum_{ij} A_{ij}^{(s)} S_{ij}^{(s)}, s . t ., \sum_{j} A_{ij}^{(s)} = 1.$ (3) Here, $A^{(s) *}$ and $S^{(s)} \in R^{U \times V}$ represent the allocation matrix and the similarity matrix, respectively, and $S_{ij}^{(s)} = S (W^{(s)} x_{i}^{(s)}, {\tilde{W}}^{(2 s)} s_{j}^{(2 s)})$ can be obtained.

U and V represent the number of data points in $x^{(s)}$ and $S^{(2 s)}$ , respectively. Then, utilizing the higher-level features learned during the clustering process allows for reconstructing low-resolution downsampled feature images into fine-resolution feature images, as shown below: (4) ${\tilde{x}}^{(s)} = A^{(s) *} x^{(2 s)} .$ (4) Therefore, the model predicts the target value $y^{(32)} = \emptyset (x^{(32)})$ from the low-resolution feature image, where $y^{(32)}$ represents the predicted target value. By recursively decoding the low-resolution prediction results, the original high-resolution feature image is obtained as shown below: (5) $y^{(1)} = \prod_{s}^{'} = {16, 8, 4, 2, 1} A^{(s^{\hat{'}}) *} y^{(32)} .$ (5) To further improve matrix $A^{(s) *}$ to a distribution matrix ${\tilde{A}}^{(s)} \in (0, 1)^{U \times V}$ , achieve a differentiable clustering process, and complete end-to-end training, each sampled clustering seed point is defined with a soft allocation matrix ${\tilde{A}}^{(s)}$ in the clustering process. ${\tilde{A}}^{(s)}$ contains the degree to which each input feature $x_{i}^{(s)}$ is assigned to each sampled clustering seed point $s_{j}^{(2 s)}$ , as shown below: (6) ${\tilde{A}}_{ij}^{(s)} = \frac{\exp (\frac{S (W^{(s)} x_{i}^{(s)}, {\tilde{W}}^{(2 s)} s_{j}^{(2 s)})}{τ})}{\sum_{k} \exp (\frac{S (W^{(s)} x_{i}^{(s)}, {\tilde{W}}^{(2 s)} s_{k}^{(2 s)})}{τ})},$ (6) where $τ$ is the temperature parameter that controls the scale of the clustering matrix ${\tilde{A}}^{(s)}$ . When $τ$ is high, the allocation is close to uniform, which prevents the model from falling into a local optima. When $τ$ is low, the model assigns higher values to the nearest cluster center, leading to stable model convergence. When $τ$ tends to 0, ${\tilde{A}}^{(s)}$ is equal to $A^{(s) *}$ . A is set to 0.07 for our experiment, and the specific steps are shown in below.

Table 1. Detailed process for the ImSE-Net.

Display Table

2.2. SLICm and superpixel optimization

Although ImSE-Net restores some detail by adding a decoding module, it still relies on two bilinear interpolation upsamplings to enhance the feature resolution, leading to less-than-ideal semantic segmentation along object edges. Considering that superpixels can protect object edges, this study optimizes semantic segmentation by combining high-level semantic features with superpixel object edge information.

Superpixels typically represent clusters of pixels sharing similar characteristics such as location, color, and texture. In this research, the Simple Linear Iterative Clustering (SLIC) algorithm (Achanta et al. Citation2012) is employed for superpixel segmentation of the input image. First, it converts the color image to the CIE-LAB color space, where each pixel is linked to a five-dimensional vector composed of color values (L, a, b) and coordinates (x, y). Subsequently, a distance metric is constructed for these vectors, and then iterative local clustering is applied to the image pixels. The specific steps and formula for the calculations are outlined below:

Initialization of seed points. It is assumed that there are N pixels in the image, and the SLIC algorithm divides the image into K superpixels with the same size. The size of each superpixel is represented by N/K. The distance between seed points can be expressed as $S = \sqrt{N / K}$ . To avoid affecting the subsequent clustering results, the seed points are moved within a $3 \times 3$ neighborhood window, while the gradient values of all pixels are calculated. The seed points are moved to the position with the smallest gradient value to prevent the seed points from being assigned to the edge or noise pixels of the image. A label is assigned to each seed point.
Similarity measurement. The method involves searching for each pixel, calculating the similarity degree between the pixel and the seed point (including color distance and spatial distance), and iterating continuously until convergence: (7) $d_{lab} = \sqrt{{(l_{k} - l_{i})}^{2} + {(a_{k} - a_{i})}^{2} + {(b_{k} - b_{i})}^{2}},$ (7) (8) $d_{xy} = \sqrt{{(x_{k} - x_{i})}^{2} + {(y_{k} - y_{i})}^{2},}$ (8) (9) $D = \sqrt{{(\frac{d_{lab}}{m})}^{2} + {(\frac{d_{xy}}{S})}^{2} .}$ (9)

Here,

[l_{k}, a_{k}, b_{k}, x_{k}, y_{k}]

represents the five-dimensional feature vector of the seed point;

[l_{i}, a_{i}, b_{i}, x_{i}, y_{i}]

represents the feature vector of the pixel point to be judged; k denotes the seed point; i refers to the pixel point to be searched in the image;

d_{lab}

refers to the color similarity between pixel points and

d_{xy}

is the spatial distance between pixel points in the neighborhood of the image; m is used to measure the relative weight of color information and spatial information, taking the fixed constant in the experiment; D represents the similarity magnitude between two pixel points: the higher the similarity, the larger the value of D. To speed up the convergence of the algorithm, SLIC searches within the range of 2S×2S.

According to the obtained SLIC superpixel results, the SLICm optimization algorithm is formulated by combining high-level semantic features and superpixel object edge information to enhance the semantic segmentation results. First, the preliminary semantic segmentation result from the ImSE-Net output is obtained. Subsequently, the count of pixels occupied by each semantic category within each superpixel is determined. Finally, the semantic category with the highest total pixel count is selected and allocated to the respective superpixel. Detailed steps are provided in below.

Table 2. Detailed process for superpixel refinement.

Display Table

2.3. Unsupervised semantic segmentation model UGLS

The unsupervised semantic segmentation model integrates the over-segmented images generated by the model in Section 2.2. The global hidden pseudo-positive samples (GHPsP) aim to diminish the similarity matrix and alleviate data redundancy by considering diverse global semantic information. The local hidden pseudo-positive samples (LHPsP) achieve fine-tuning of semantic similarity in locally unbalanced areas, complete the overall optimization of local and global consistency, and obtain the final image segmentation ().

Figure 3. Illustration of the global hidden pseudo-positives and local hidden pseudo-positive selection process. GHPsp: unlabeled image samples $x_{i}^{'}$ ; feature extractor F; task-agnostic reference pool $Q^{ag}$ ; anchor features $f_{i}$ ; the segmentation head S produces corresponding segmentation features $s_{i} = S (f_{i})$ ; an index set of $P_{i}^{ag}$ for each i-th anchor feature $f_{i}$ ; the momentum segmentation head $S^{'}$ produces corresponding segmentation features $s_{i}^{'} = S^{'} (f_{i})$ ; task-specific reference pool $Q^{sg}$ ; an index set of $P_{i}^{sp}$ by comparing $s_{i}^{'}$ and $Q^{sg}$ ; the projection head Z produces a projection anchor vector $z_{i}$ ; GHPsP contrastive loss for i-th patch in unsupervised semantic segmentation with multiple positives $L_{ag}^{cont}$ and $L_{sp}^{cont}$ ; LHPsp: an index set $M_{i}^{nei}$ that contains the i-th anchor and its neighboring patches; it uses the average attention score value of ${\tilde{T}}_{i}$ as the threshold for selecting LHPsP $M_{i}^{loc}$ among $M_{i}^{nei}$ and then takes the above-average portion as the final experiment selection; the calculated gradient (G) for mixed feature $s_{i}^{mix}$ combines the neighboring positive features $G_{i}$ with the corresponding attention scores $T_{i}$ proportionally; the LHPsP objective functions $Z (s_{i}^{mix})$ produces a projected mixed vector $z_{i}^{mix}$ ; the global cost is Lc, and the local loss is L_r.

2.3.1. Global hidden pseudo-positives GHPsP

To uncover hidden positive pseudo-samples, enhance the semantic consistency of pixels, and assist in optimizing the contrastive loss function of unsupervised semantic segmentation, this paper proposes a GHPsP module that mainly includes task-agnostic GHPsP $P_{i}^{ag}$ and task-specific GHPsP $P_{i}^{sp}$ .

Given a mini-batch $X_{i}^{\hat{'}} = {{\overset{\hat{'}}{x}}_{i}}_{i = 1}^{E}$ including E unlabeled image samples, $x_{i}^{\hat{'}}$ obtains anchor features $f_{i} \in ℜ^{H \times W \times C}$ by processing them through a feature extractor F. For each image, a random feature $f_{i}$ is sampled to construct a task-agnostic reference pool $Q^{ag} = {q_{d}}_{d = 1}^{D}$ , ensuring the universality of the reference pool and increasing the comparability of the model.

We can calculate the similarity $b_{i}$ of each pixel feature $f_{i}$ to $Q^{ag}$ inside and collect hidden positive examples, using the following formula: (10) $b_{i} = max_{q_{d} \in Q^{ag}} sim (q_{d}, f_{i}),$ (10)

where D represents D random features from F, and $sim (\cdot, \cdot)$ represents cosine similarity between two vectors. If the similarity between $f_{i}$ and $f_{j}$ is greater than $b_{i}$ , then other features in the mini-batch $f_{j}$ are considered positive samples.

To ensure consistency in training and avoid ambiguity in the relationship between two patch features, a reference pool with distribution awareness is implemented to enable the discovery of globally similar features by considering each anchor point. (11) $P_{i}^{ag} = {j | sim (f_{i}, f_{j}) > b_{i} \lor sim (f_{i}, f_{j}) > b_{j}},$ (11) where j indicates the index for different patch features in the mini-batch processing.

The randomness of the reference pool $Q^{ag}$ accounts for the diversity of global semantic information but lacks task specificity. Therefore, we propose sampling the features of the momentum segmentation head $S^{'}$ and constructing additional GHPsP $P_{i}^{sp}$ to obtain a more comprehensive and representative set of positive samples, thus enhancing the robustness of the network. According to Eq. (11), (12) $P_{i}^{sp} = {j | sim (s_{i}^{'}, s_{j}^{'}) > b_{i}^{'} \lor sim (s_{i}^{'}, s_{j}^{'}) > b_{j}^{'}},$ (12) where $s^{'} = S^{'} (f)$ and task-specific reference pool $Q^{sg}$ comprises s’. Unlike the fixed samples generated by the pretrained backbone network $Q^{ag}$ , $Q^{sg}$ is updated synchronously with S’ during the training period.

β% negative sample set for each i-th anchor $H_{i}$ is randomly selected from the remaining patches in our experiment. Therefore, $H_{i}^{ag}$ and $H_{i}^{sg}$ correspond to $P_{i}^{ag}$ and $P_{i}^{sg}$ , and the loss function can be defined as follows: (13) $L^{cont} (z_{i}, P, H) = - \frac{1}{| P |} \sum_{p \in P} \log \frac{\exp (sim (z_{i}, z_{p}) / τ)}{\sum_{h \in (H \cup P)} \exp (sim (z_{i}, z_{h}) / τ)},$ (13) where $z_{i} = Z (S (f_{i}))$ represents the projected anchor vector, P represents the positive index set, and H represents the negative index set.

2.3.2. Local hidden pseudo-positives LHPsP

While adjacent pixels might share the same semantic label, the segmentation boundary often exhibits substantial variability in semantic information, posing challenges for achieving semantic consistency within the local area. Inspired by the capacity of vision transformers (ViTs) (Dosovitskiy et al. Citation2020) in modeling long-range dependencies and global context, we propose leveraging the backbone F of ViT to construct the local matching set LHPsP by selecting regions characterized by high attention scores. The feature vector is used as a weight to propagate the gradient correctly to the LHPsP and achieve semantic similarity fine-tuning of locally imbalanced regions. This approach facilitates comprehensive optimization of both local and global consistency, significantly enhancing the model's segmentation performance.

Given an index set $M_{i}^{nei}$ that contains the i-th anchor and its neighboring patches, and the spatial attention score ${\tilde{T}}_{i} \in R^{HW}$ obtained from the output of the last self-attention layer of backbone F, our method uses the average value of ${\tilde{T}}_{i}$ as the threshold for selecting LHPsP $M_{i}^{loc}$ among $M_{i}^{nei}$ : (14) $M_{i}^{loc} = {j | j \in M_{i}^{nei} \land t_{j} > Avg ({\tilde{T}}_{i})},$ (14) where $t_{j}$ represents the j-th element of ${\tilde{T}}_{i}$ , and $Avg (\cdot)$ represents the mean function. The neighboring positive features $G_{i}$ and their corresponding attention scores $T_{i}$ are obtained using Eq. (14): (15) $G_{i} = {s_{j} | j \in M_{i}^{loc}},$ (15) (16) $T_{i} = {t_{j} | j \in M_{i}^{loc}},$ (16) where $s_{j}$ is the feature of the j-th patch extracted by the momentum segmentation head S. Mixing the patch features from $G_{i}$ with the corresponding attention scores from $T_{i}$ in proportion, we obtain: (17) $s_{i}^{mix} = \frac{\sum_{j \in M_{i}^{loc}} σ g_{j} t_{j}}{| M_{i}^{loc} |}, z_{i}^{mix} = Z (s_{i}^{mix}),$ (17) where $s_{i}^{mix}$ is a mixed patch; $z_{i}^{mix}$ is a projected mixed vector; $σ$ is a scalar value that measures the attention score and scales the attention scores to an appropriate range to ensure the stability and correctness of gradient propagation; $g_{j}$ indicates the j-th element of $G_{i}$ .

2.3.3. Loss function

The cost function of this article consists of two parts: global cost and local cost. The global cost is represented by Eq. (18). (18) $L_{c} = \sum_{i} \sum_{k} {\tilde{Q}}_{ik} \log {\tilde{Q}}_{ik} - {\tilde{Q}}_{ik} \log {\tilde{P}}_{ik} + l (P),$ (18) where $l (\cdot)$ is the regularization term, $\tilde{P}$ is the probability that the pixels are assigned to seed points, $i$ represents a pixel point, and k represents a seed point. The local loss $L_{r}$ is a key means for adjusting the weight parameters according to the gradient reconstruction. It encompasses the reconstruction of the color and spatial features of the input image and is defined as (19) $L_{r} = L_{r_{c}} + \emptyset * L_{r_{s}},$ (19) where $L_{rc}$ is the reconstruction loss of color features, $L_{rs}$ is the local loss of spatial features, and $\emptyset$ controls the weights of $L_{rc}^{i}$ and $L_{rs}^{i}$ . However, as the gradient is reconstructed as a bidirectional gradient, $L_{r}$ can be written as (20) $L_{r} = \sum_{i \notin V_{b}} (L_{r_{c}}^{i} + \emptyset * L_{r_{s}}^{i}) + \sum_{i \in V_{b}} (L_{r_{c}}^{i} + B_{i} * \emptyset * L_{r_{c}}^{i}),$ (20) where $V_{b} = {n | B_{n} > ϵ}$ is the number of pixel sets, and $\sum_{i \in V_{b}} (B_{i} * \emptyset {* L}_{r_{s}}^{i})$ is a regularization term to avoid over-reliance on the spatial characteristics of pixels in $V_{b}$ during the clustering process. Finally, the loss functions in the proposed method are (21) $L = L_{c} + β * L_{r},$ (21) where $β$ balances the two loss functions.

The steps of the unsupervised semantic segmentation model are presented in below.

Table 3. Detailed process for unsupervised semantic segmentation UGLS.

Display Table

3. Experiments and discussion

3.1. Datasets

To assess the effectiveness of the proposed method, we selected three widely used HSRSIs categorized into three groups, totaling 12 scenes. These images primarily encompassed the COCO-stuff dataset (Caesar, Uijlings, and Ferrari Citation2018), Jilin-1 dataset (JL-1), and Beijing-2 dataset (BJ-2). A detailed breakdown of the images is provided in . The dataset consisted of 12,000 images designated for semantic segmentation. For the experiment, 80% of the data was allocated to the training set, and the remaining 20% served as the validation set. Ground truths corresponding to the original remote sensing images were color-assigned for each cluster, facilitating visual comprehension through color and name matches.

Table 4. Grouping information about datasets.

Download CSV Display Table

The COCO-stuff dataset is widely employed in experimental tasks such as object detection and image semantic segmentation owing to its extensive categories and finely labeled data. We selected nine sub-categories from this dataset, excluding the background category ().

Table 5. Test set task categories.

Download CSV Display Table

JL-1 originates from the China Center for Resource Satellite Data and Applications (CRESDA). Regarded as China’s inaugural commercial satellite, JL-1 features a high spatial resolution and sensitivity, enabling precise ground imaging. The dataset includes nine sub-categories, excluding the background category ().

The BJ-2 constellation system originated from China’s national ‘15th Five-Year Plan’ and the ‘863’ program, and it is focused on addressing significant scientific and technological challenges. It is China’s first remote sensing satellite constellation project established through a market-based approach. The project promotes institutional and technological innovation within China’s aerospace industry, facilitating market resource allocation. The dataset comprises nine sub-categories, excluding the background category ().

3.2. Evaluation accuracy

Qualitative and quantitative methods are used to evaluate the performances of the proposed model and the baseline algorithms. Boundary adhesion, shape heterogeneity, and the segmentation accuracy of different feature categories are compared in the qualitative analysis. In the quantitative analysis, the algorithms are compared in terms of the F1-score, OA, MIoU, ACC, and NMI indicators ().

Table 6. Quantitative evaluation indicators.

Display Table

Here, we compare the model results with the actual data to obtain the confusion matrix $P = {p_{ij}} \in N^{k \times k}$ of the result; $p_{ii}$ denotes the number of pixels of category i predicted as category i; $p_{ji}$ represents the number of pixels belonging to category j predicted as category i; $p_{ij}$ indicates the number of pixels from class i predicted as class j; k denotes the number of categories. Furthemore, $\Pr$ , Re and $k_{i}$ represent the segmentation accuracy, the boundary recall rate and the clustering assignment of data, respectively (Wang et al. Citation2021c).

F₁-score is a common evaluation metric applied to semantic segmentation (Hou et al. Citation2020); OA is the proportion of the number of correctly classified pixels to the total number of pixels (Zhao et al. Citation2022); MIou refers to the average intersection over union of each category (Zhao et al. Citation2022); ACC is the accuracy; and NMI is the normalized mutual information, where K is the cluster distribution, Y is the ground truth, and H() is the entropy. When distribution K is similar to Y, the NMI value approaches 1; otherwise, the NMI value is close to 0 (Zhao et al. Citation2018).

3.3. Analysis of results

Comparison with Baseline Algorithms: K-means+ (Pedregosa et al. Citation2011), SLIC based on K-means classification (SLIC*) (Achanta et al. Citation2012), automatic fuzzy clustering framework based on K-means classification (AFCF*) (Lei et al. Citation2020), sobel adaptive morphological reconstruction and watershed algorithm based on K-means classification (SOAWS*) (Jia et al. Citation2020), unsupervised semantic segmentation using self-learning superpixel network (SLSP-Net*) (Yang et al. Citation2022), ImSE-Net+SLICm, unsupervised semantic segmentation using invariance and equivariance in clustering (PiCIE) (Cho et al. Citation2021), unsupervised semantic segmentation by feature correspondences (STEGO) (Hamilton et al. Citation2022). Detailed information is provided in below, where the first row denotes the method, and the second row lists the author and publication date.

Table 7. Description of the unsupervised semantic segmentation algorithms.

Download CSV Display Table

Legend: +Method based on K-means. *Method does not directly learn a classification function and requires further application of the K-means classifier for unsupervised image semantic segmentation.

The experiments were conducted on a Windows 10 platform featuring an Intel(R) Core(TM) i7-10700 2.9 GHz CPU and 16 GB RAM. All algorithms and evaluations were implemented using PyTorch 1.1, Python 3.8, CUDA 11.3, and Matlab 2020a. We initialize the unsupervised semantic segmentation using the ViT models pretrained by ImageNet for fine tuning and employ extended convolution to extract dense image semantic features. Stochastic gradient descent was used for model training, with an initial learning rate set to 0.0005 and a weight decay of 0.1. Additionally, following (Liu, Rabinovich, and Berg Citation2015), we adopted the ‘poly’ learning rate policy for decay, in which the initial learning rate is multiplied by $1 - {(\frac{iter}{\max_iter})}^{0.9}$ (Chen et al. Citation2018). Data augmentation involved random scaling of input images with a ratio of 0.5–2 and flipping during training.

3.3.1. Segmentation test results

The recommended methods were employed to perform segmentation on the COCO-stuff test set and our two original image test tasks. Experimental results are grouped and visualized in , while corresponding quantified results are presented in .

Figure 4. Example results with the COCO-stuff test set.

Figure 5. Example results on the build test set.

Figure 6. Example results on the mixed region test set.

Table 8. Per-class result on the COCO-stuff test set.

Download CSV Display Table

Table 9. Evaluation scores (%) of different baseline methods based on the COCO-stuff test set.

Download CSV Display Table

Table 10. Comparison between baseline and state-of-the-art methods for experiment 2. The best values are highlighted in bold.

Download CSV Display Table

Table 11. Per-class result for experiment 3.

Download CSV Display Table

Table 12. Comparison between baseline and state-of-the-art methods for experiment 3.

Download CSV Display Table

3.3.1.1. Test results for experiment 1

Qualitative Evaluation

displays qualitative segmentation examples of the first task set achieved by our method, comparing it with state-of-the-art competitors. The images show representative scenes from natural landscapes based on COCO-stuff, featuring simple object types but exhibiting diverse and rich content. The clear color and texture of the terrain make it susceptible to confusion with surrounding objects.

Considering the segmentation results for . I1 and . I2, K-means+ and SLIC* tend to overly emphasize intricate details and noise from the original images, resulting in inadequate boundary adhesion, as illustrated in . Conversely, SOAWS*, AFCF*, SLSP-Net*, ImSE-Net+SLICm, PiCIE, and STEGO exhibit superior performance in terms of the shape and integrity of segmentation. However, these methods encounter challenges in addressing over-segmentation, leading to further fragmentation in certain regions. Our proposed method shows superior performance in preserving object boundary information, particularly in separating small blocks, such as persons, chairs, and grass. It consistently achieves accurate segmentation of larger objects such as buildings.

According to . I3 and . I4, K-means+, SLIC*, SOAWS*, and AFCF* methods generate segmented regions with varied sizes and shapes. Although these methods align the segmentation boundaries well with the main buildings, the overall segmentation accuracy remains comparatively inadequate. Additionally, SOAWS*, AFCF*, SLSP-Net*, ImSE-Net+SLICm, PiCIE, and STEGO outperform other methods in terms of time efficiency. However, they suffer from severe under-segmentation, irregular rectangular shapes, and, in some instances, incorrectly merged fuzzy boundaries. Our proposed method significantly enhances the model’s ability to make explicit predictions in object classification and performs exceptionally well in capturing boundary details. For instance, in . I3, the segmentation distinctly separates the sky and ocean, displaying high completeness and evident boundaries between categories, effectively overcoming the effects of spectral similarity. Although there are instances of misclassification, such as trees being mistakenly grouped with buildings, our method provides the closest segmentation to the actual results overall.

(2)

Quantitative Evaluation

The image segmentation results from experiment 1 were accurately evaluated using metrics such as F1-score, OA, MIoU, ACC, and NMI ( and ). The best values are highlighted in bold.

According to and , K-means+ and SLIC* result in severe over-segmentation and exhibit relatively low evaluation metrics. SOAWS* and AFCF* incorrectly merge building and vegetative regions, particularly with minimal improvement in excessive segmentation of vegetation. SLSP-Net*, ImSE-Net+SLICm, PiCIE, and STEGO present numerous noisy points in building segmentation, with noticeable deviations at terrain segmentation boundaries. The proposed method outperforms the other algorithms in terms of the experimental metrics and demonstrates superiority in terrain segmentation.

3.3.1.2. Test results for experiment 2

Qualitative Evaluation

displays sample images, corresponding label maps, and qualitative classification results of the second task set obtained by our method, juxtaposed with state-of-the-art competitors. This dataset, sourced from Jl-1 and BJ-1, represents a densely built-up area with low intra-class variance. The rich roof color and texture structure pose challenges in differentiating them from the surrounding terrain, thereby increasing segmentation difficulty.

According to the images in . I5 and . I6, a large number of noisy points are generated when K-means+, SLIC*, and SOAWS* are used to segment vegetation and buildings, resulting in a deterioration of the overall segmentation effect. AFCF*, SLSP-Net*, PiCIE, ImSE-Net +SLICm and STEGO fragment the vegetation and road regions, displaying irregular rectangles in segmentation, with some regions showing broken segments. Moreover, some dense areas are incorrectly merged. Our method demonstrates improvements in addressing over-segmentation, resulting in more consistent sizes and shapes of segmented objects with the reference image. However, some mis-segmentation persists, notably in certain areas such as roads and vegetation. Considering the segmentation results for . I7 and . I8, K-means+ and SLIC* exhibit strong adherence to terrain but suffer from severe over-segmentation. Owing to spectral similarities between building boundaries and roads, SOAWS*, AFCF*, SLSP-Net*, and ImSE-Net+SLICm show evident mis-segmentation, leading to reduced accuracy. PiCIE and STEGO display significant under-segmentation and excessive merging of superpixel regions. Although the proposed method encounters errors in merging, it improves segmentation fragmentation and aligns more consistently with the visual segmentation effect.

(2)

Quantitative Evaluation

The best values are in bold. The results of image segmentation in experiment 2 were evaluated in terms of F1-score, OA, MIoU, ACC, and NMI, as shown in .

Compared with other algorithms, the proposed algorithm exhibits superior segmentation performance for detailed areas such as object edges owing to its integration of superpixel information (). Overall, our method exhibits superior performance across multiple indicators (OA = 75.38, MIoU = 35.26, ACC = 75.47, NMI = 75.58), indicating that the improvement was brought by the proposed hidden positive pseudo-samples based on the unsupervised semantic segmentation model.

3.3.1.3. Test results for experiment 3

Qualitative Evaluation

As shown in , the third dataset is a mixed land area, with more chaotic spatial relationships and complex adjacent relationships among objects than the datasets used in experiments 1 and 2. The sizes and scales of objects are also diverse, making segmentation more difficult.

According to the images in . I9 and . I10, the K-means+, SLIC*, SOAWS*, AFCF*, SLSP-Net*, ImSE-Net+SLICm, PiCIE, and STEGO algorithms successfully segment water and tree areas, but the segmentation results appear fragmented and irregular, with some areas being notably more fragmented than others. Additionally, there are instances of incorrect merging in dense areas. Our method addresses over-segmentation, with sizes and shapes of segmented objects that are more consistent with the reference image. However, some errors in object segmentation persist in locations such as water and plants.

. I11 and . I12 present diverse mixed land areas encompassing water, low vegetation, farmland, bare ground, roads, and buildings, forming a more complex scene. Our method outperforms K-means+, SLIC*, SOAWS*, AFCF*, PiCIE, SLSP-Net*, ImSE-Net+SLICm, and STEGO algorithms in separating buildings from adjacent water fields and preserving the integrity of greenhouse areas. However, our method exhibits sensitivity to spectral features of plants, resulting in a few misplaced small segments and misclassification of farmland as vegetation.

(2)

Quantitative Evaluation

Comparison of five indicators (F1-score, OA, MIoU, ACC, and NMI) for experiment 3. The best values are in bold, as shown in and .

As observed in and , when there is a simultaneous increase in map width, type of ground object, spatial distribution, image resolution, and contour complexity, both the K-means+ and SLIC* segmentation methods show relatively low segmentation accuracy. The segmentation results of SOAWS*, AFCF*, SLSP-Net*, ImSE-Net+SLICm, PiCIE, and STEGO indicate that while the accuracy of these algorithms is generally impacted, the convergence and stability differ considerably owing to the distinct principles underlying each algorithm. Notably, the deep fuzzy clustering models such as ImSE-Net+SLICm, STEGO, and our model outperform the shallow models, highlighting the advantage of leveraging deep learning techniques for acquiring clustering-friendly representations. Moreover, our proposed model shows improvements in OA, MIoU, ACC, and NMI values by 0.42%, 4.30%, 2.03%, and 2.06%, respectively, compared with STEGO. These results emphasize that uncovering hidden pseudo-samples enhances the semantic consistency of pixels, thereby enhancing the clustering performance of deep fuzzy models.

In summary, the evaluation indicators across the three sets of experimental image collections decrease as the difficulty of high-resolution remote sensing image segmentation increases. Nevertheless, the proposed method consistently demonstrates a higher evaluation value than the other algorithms. According to this analysis, the visual effect of the segmented image is consistent with the intuitive conclusions drawn from the experimental indicators, and the proposed method achieves the highest segmentation accuracy among the tested methods.

We have demonstrated the capability of our approach to effectively manage unbalanced areas, specifically object boundaries with significant variability in semantic information, yielding more accurate contours and superior segmentation outcomes marked by heightened semantic consistency. However, several aspects within our proposed framework remain unexplored and underdeveloped. Although our work enhances the prediction accuracy of encoder–decoder architectures by integrating the proposed new approach and replacing downsampling in the encoder with our superpixel-based downsampling to group pixels, we still face challenges in eliminating errors in local semantic information. These errors become more pronounced when the source and target domains have different data imaging modes. Additionally, while our method performs segmentation on the COCO-stuff test set and our two test set tasks with original images, the effectiveness of our approach on other datasets remains a question. In the follow-up research, we aim to expand the data sources and scale to further support the investigation of semantic segmentation model structures for remote sensing images. Furthermore, we plan to incorporate data into the model coding layer, leveraging prior knowledge to enhance the model’s feature learning ability and ultimately improve recognition accuracy.

3.3.1.4. Ablation study

In the ablation study, inspired by the accomplishments of the data decomposition model (Feng et al. Citation2020), we considered three test aspects to evaluate our approach:

Effectiveness of embedding clustering model: We compared the effectiveness of clustering model connections among ResNet-18, ResNet-34, ResNet-50, and ResNet-101 backbones of FCN-32 to determine the optimal combination for ImSE-Net ().
Sensitivity of hierarchical levels: We compared the clustering accuracy of the backbone network when combined with ImSE-Net at different hierarchical levels, including Conv2_x, Conv3_x, Conv4_x, and Conv5_x, to identify the sensitive region within the hierarchy ().
Influence of hidden pseudo-samples model: We evaluated the contribution of different components of UGLS with various combinations to examine the improvement brought by the proposed global–local hidden pseudo-samples model ().
Precision analysis of different stages: We compared the effectiveness of the model links in stage 1 (ImSE-Net+SLICm), stage 2 (UGLS), and the proposed method in different stages to determine the optimality of the model stage links in the proposed method ().

Here, ‘Im’ refers to the ImSE-Net model, which replaces the striding of Conv4_x and Conv5_x in the backbone of FCN-32. In this ablation study, ImSE-Net achieves its best merging capability when ResNet-101 is used as the backbone network, but the test results for ResNet-34 and ResNet-50 are close. This illustrates that as the number of convolutional layers increases, the network becomes more adept at handling complex features and larger datasets, resulting in higher accuracy and improved generalization ability. Furthermore, our proposed ImSE-Net model demonstrates excellent robustness, enabling seamless integration into existing architectures without altering their feedforward paths.

We only present the performance of ResNet-101 and ResNet-18 because the performance of ResNet-34 and ResNet-50 falls between those of these two models, as confirmed by . Considering the clustering effectiveness of the backbone shown in , the performance consistently improves as the hierarchical levels increase.

Comparisons of (b) and (a) as well as (f) and (e) indicate that the most significant performance enhancements occur when ImSE-Net is incorporated at hierarchical levels, progressing from 1 (Conv5_x) to 2 (Conv4_x and Conv5_x). When the hierarchical levels are further expanded to 3 (Conv3_x, Conv4_x, and Conv5_x), as in (g) and (c), the algorithm presents varying degrees of fluctuations, resulting in a slight increase (b) and a significant decrease (f). However, embedding the model with four levels results in a sharp drop for ResNet-18 (h) compared with (c), while ResNet-101 (h) experiences a significant boost, reaching its peak compared with (g) and outperforming all other algorithms, which demonstrates that the improvement is brought by the proposed implicit clustering module.

Here, IS, GHPsP, LHPsP, TAs, TSs, and CT denote ImSE-Net+SLICm, global hidden pseudo-positive, local hidden pseudo-positive, task-agnostic training, task-specific training, and consistency assignment, respectively.

According to , compared with (a) in which both the task-specific GHPsP and LHPsP are not used, employing each of them leads to improvements in OA (+2.11%), MIoU (+1.23%), ACC (+2.22%), and NMI (+2.22%) for (b), and OA (+3.36%), MIoU (+1.86%), ACC (+3.10%), and NMI (+3.14%) for (c). These improvements demonstrate that the introduction of both task-specific GHPsP and LHPsP enhances the semantic consistency of pixels and contributes to the optimization of the classification accuracy of unsupervised semantic segmentation. Additionally, a comparison of (d) and (e) demonstrates the significance of maintaining consistent assignments in selecting GHPsP. The comparison shows advancements in OA (+4.13%), MIoU (+2.86%), ACC (+3.51%), and NMI (+4.24%). Overall, the combination of GHPsP and LHPsP offers a significant advantage over the baseline and some popular deep clustering methods () in accurately segmenting ground objects and different regions with various objects.

show that stage 1 and stage 2 are independent and significantly different from each other, and both have a beneficial effect on the performance of the algorithm in this paper.

The stage 2 further improves the clustering accuracy of stage 1 on OA (+7.49%), MIoU (+1.37%), ACC (+6.93%), NMI (+5.89%) and Time (+533s), which demonstrates that although excavating hidden positives to learn rich semantic relationships and ensure semantic consistency in local regions is beneficial for segmentation tasks, it takes more time to implement. Note that the overall segmentation accuracy improves progressively with the gradual addition of stages. The performance of Ours is better than that of stage 2 (OA: +2.13%; MIoU: +2.23%; ACC: +1.58%; NMI: +3.40%; Time: +4526s), showing the importance of the proposed stage 1. In summary, our proposed method achieves the best segmentation capacity, as it combines both algorithms and consistently. By testing them all (and more) in , the results justify our choices.

4. Conclusion

To address inaccuracies in boundary segmentation and inconsistent sub-area predictions for ground objects in the deep learning-based semantic segmentation of HSRSIs, the approach integrates the ImSE-Net model with the SLICm superpixel model, merging high-level abstract semantic features with object edge details from superpixels to generate semantic segmentation results. This method aids in recovering image details lost during the pooling or downsampling processes, particularly in capturing object edges. Moreover, the method leverages an unsupervised semantic segmentation model and incorporates both global and local implicit regularization methods to learn rich semantic information with local consistency and effectively merge over-segmented regions to obtain final segmentation results. However, while our method shows improvements, the increase in accuracy is moderate. This network optimization algorithm only adjusts certain function modules to ensure that the models are available for task-specific segmentation. Therefore, establishing a robust migration model for different data sets is a challenge. Additionally, the underlying scene structure extracted by relationship and interaction modeling is an important research direction to promote the accuracy of image semantic segmentation in the future. Although researchers have been trying to go deep to the relational modeling, this research is still in the preliminary stage.

Author contributions

Acquisition of the financial support for the project leading to this publication, H.N. and H.L. Application of statistical, mathematical, computational, or other formal techniques to analyze or synthesize study data, Z.Y., X.W. and K.Y. Preparation, creation, and/or presentation of the published work by those from the original research group, specifically critical review, commentary, or revision, including pre- or post-publication stages, H.N. and Z.Y.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The code used in this study are available by contacting the corresponding author.

Additional information

Funding

This research was funded by Scientific and Technological Innovation Team of Universities in Henan Province, grant number 22IRTSTHN008.

References

Achanta, Radhakrishna, Appu Shaji, Kevin Smith, Aurelie Lucchi, and Pascal Fua. 2012. “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11): 2274–2281. https://doi.org/10.1109/TPAMI.2012.120.
PubMed Web of Science ®Google Scholar
Ahmadi, Salman, M. J. Valadan Zoej, Hamid Ebadi, Hamid Abrishami Moghaddam, and Ali Mohammadzadeh. 2010. “Automatic Urban Building Boundary Extraction from High Resolution Aerial Images Using an Innovative Model of Active Contours.” International Journal of Applied Earth Observation and Geoinformation 12 (3): 150–157. https://doi.org/10.1016/j.jag.2010.02.001.
Web of Science ®Google Scholar
Alam, Muhammad, Jian-Feng Wang, Cong Guangpei, L. V. Yunrong, and Yuanfang Chen. 2021. “Convolutional Neural Network for the Semantic Segmentation of Remote Sensing Images.” Mobile Networks and Applications 26 (1): 200–215. https://doi.org/10.1007/s11036-020-01703-3.
Web of Science ®Google Scholar
Baatz, M., and A. Schape. 2000. “Multiresolution Segmentation: An Optimization Approach for High Quality Multi-Scale Image Segmentation.” Proceedings of the Beiträge zum AGIT-Symposium, 12–23.
Google Scholar
Caesar, H., J. Uijlings, and V. Ferrari. 2018. “COCO-Stuff: Thing and Stuff Classes in Context.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1209–1218.
Google Scholar
Chen, Yanlin, Guojin He, Ranyu Yin, Kaiyuan Zheng, and Guizhou Wang. 2022. “Comparative Study of Marine Ranching Recognition in Multi-Temporal High-Resolution Remote Sensing Images Based on DeepLab-V3+ and U-Net.” Remote Sensing 14 (22): 5654. https://doi.org/10.3390/rs14225654.
Web of Science ®Google Scholar
Chen, Liangchieh, Yukun Zhu, George Papandreou, Florian Schrof, and Hartwig Adam. 2018. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” Proceedings of the European Conference on Computer Vision, 801–818.
Google Scholar
Cho, Jang Hyun, Utkarsh Mall, Kavita Bala, and Bharath Hariharan. 2021. “PiCIE: Unsupervised Semantic Segmentation Using Invariance and Equivariance in Clustering.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16794–16804.
Google Scholar
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, and Neil Houlsby. 2020. “An Image Is Worth 16(16 Words: Transformers for Image Recognition at Scale.” Proceedings of the International Conference on Learning Representations.
Google Scholar
Du, Shouji, Shihong Du, Bo Liu, and Xiuyuan Zhang. 2021. “Incorporating DeepLabv3+ and Object-Based Image Analysis for Semantic Segmentation of Very High Resolution Remote Sensing Images.” International Journal of Digital Earth 14 (3): 357–378. https://doi.org/10.1080/17538947.2020.1831087.
Web of Science ®Google Scholar
Feng, Qiying, Long Chen, C. L. Philip Chen, and Li Guo. 2020. “Deep Fuzzy Clustering – A Representation Learning Approach.” IEEE Transactions on Fuzzy Systems 28 (7): 1420–1433. https://doi.org/10.1109/TFUZZ.2020.2966173.
Web of Science ®Google Scholar
Gadde, Raghudeep, Varun Jampani, Martin Kiefel, Daniel Kappler, and Peter V. Gehler. 2015. Superpixel Convolutional Networks Using Bilateral Inceptions. Cham: Springer.
Google Scholar
Guo, Z. C., J. M. Xu, and A. D. Liu. 2021. “Remote Sensing Image Semantic Segmentation Method Based on Improved Deeplabv3+.” Proceedings of the International Conference on Image Processing and Intelligent Control 11928:101–109.
Google Scholar
Hamilton, Mark, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T. Freeman. 2022. “Unsupervised Semantic Segmentation by Distilling Feature Correspondences.” Proceedings of the International Conference on Learning Representations.
Google Scholar
Hou, Mengjing, Jianpeng Yin, Jing Ge, Yuanchuan Li, Qisheng Feng, and Tiangang Liang. 2020. “Land Cover Remote Sensing Classification Method of Alpine Wetland Region Based on Random Forest Algorithm.” Transactions Chin. Soc. Agric. Mach 51 (7): 220–227. https://doi.org/10.6041/j.issn.1000-1298.2020.07.025.
Google Scholar
Huang, Xin., Y. X. Cao, and J. Y. Li. 2020. “An Automatic Change Detection Method for Monitoring Newly Constructed Building Areas Using Time-Series Multi-View High-Resolution Optical Satellite Images.” Remote Sensing of Environment 244: 111802. https://doi.org/10.1016/j.rse.2020.111802.
Web of Science ®Google Scholar
Jia, Xiaohong, Tao Lei, Peng Liu, Dinghua Xue, Hongying Meng, and Asoke K. Nandi. 2020. “Fast and Automatic Image Segmentation Using Superpixel-Based Graph Clustering.” IEEE Access 8: 211526–211539. https://doi.org/10.1109/ACCESS.2020.3039742.
Web of Science ®Google Scholar
Jiang, Feng, Qign Gu, Huizhen Hao, Na Li, YanWen Guo, and Daoxu Chen. 2016. “Survey on Content-Based Image Segmentation Methods.” Journal of Software 28 (1): 160–183. https://doi.org/10.13328/j.cnki.jos.005136.
Google Scholar
Jiang, Jie, Chengjin Lyu, Siying Liu, Yongqiang He, and Xuetao Hao. 2020. “RWSNet: A Semantic Segmentation Network Based on SegNet Combined with Random Walk for Remote Sensing.” International Journal of Remote Sensing 41 (2): 487–505. https://doi.org/10.1080/01431161.2019.1643937.
Web of Science ®Google Scholar
Kotaridis, Ioannis., and Maria Lazaridou. 2021. “Remote Sensing Image Segmentation Advances: A Meta-Analysis.” ISPRS Journal of Photogrammetry and Remote Sensing 173: 309–322. https://doi.org/10.1016/j.isprsjprs.2021.01.020.
Web of Science ®Google Scholar
Kwak, S., S. Hong, and B. Han. 2017. “Weakly Supervised Semantic Segmentation Using Superpixel Pooling Network.” Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17), 4111–4117.
Google Scholar
Lei, Tao, Yuntong Li, Wenzhen Zhou, Qibin Yuan, Chengbin Wang, and Xiaohong Zhang. 2022. “Grain Segmentation of Ceramic Materials Using Data-driven Jointing Model-driven.” Acta automatica sinica 48 (4): 1137–1152. https://doi.org/10.16383/j.aas.c200277.
Google Scholar
Lei, Tao, Peng Liu, Xiaohong Jia, Xuande Zhang, Hongying Meng, and Asoke K. Nandi. 2020. “Automatic Fuzzy Clustering Framework for Image Segmentation.” IEEE Transactions on Fuzzy Systems 28 (9): 2078–2092. https://doi.org/10.1109/TFUZZ.2019.2930030.
Web of Science ®Google Scholar
Li, J. Y., X. Huang, and J. Y. Gong. 2019. “Deep Neural Network for Remote-Sensing Image Interpretation: Status and Perspectives.” National Science Review 6 (6): 1082–1086. https://doi.org/10.1093/nsr/nwz058.
PubMed Web of Science ®Google Scholar
Li, Haifeng, Kaijian Qiu, Li Chen, Xiaoming Mei, Liang Hong, and Chao Tao. 2021. “SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images.” IEEE Geoscience and Remote Sensing Letters 18 (5): 905–909. https://doi.org/10.1109/LGRS.2020.2988294.
Web of Science ®Google Scholar
Liu, Chun, Xin Huang, Zhe Zhu, Huijun Chen, Xinming Tang, and Jianya Gong. 2019a. “Automatic Extraction of Built-up Area from ZY3 Multi-View Satellite Imagery: Analysis of 45 Global Cities.” Remote Sensing of Environment 226 (June): 51–73. https://doi.org/10.1016/j.rse.2019.03.033.
Google Scholar
Liu, Han, Jun Li, Lin He, and Yu Wang. 2019b. “Superpixel-Guided Layer-Wise Embedding CNN for Remote Sensing Image Classification.” Remote Sensing 11 (2): 174. https://doi.org/10.3390/rs11020174.
Web of Science ®Google Scholar
Liu, W., A. Rabinovich, and A. C. Berg. 2015. “ParseNet: Looking Wider to See Better.” Proceedings of the International Conference on Learning Representations.
Google Scholar
Ma, Bifang., and Chih-Yung Chang. 2022. “Semantic Segmentation of High-Resolution Remote Sensing Images Using Multiscale Skip Connection Network.” IEEE Sensors Journal 22 (4): 3745–3755. https://doi.org/10.1109/JSEN.2021.3139629.
Web of Science ®Google Scholar
Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, and Bertrand Thirion. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (2011): 2825–2830. https://doi.org/10.48550/arXiv.1201.0490.
Google Scholar
Ren, Xiaofeng., and Jitendra Malik. 2003. “Learning a Classification Model for Segmentation.” Proceedings Ninth IEEE International Conference on Computer Vision 1:10–17.
Google Scholar
Ren, Z. L., Q. P. Zhai, and L. Sun. 2023. “Spectral Clustering Eigenvector Selection of Hyperspectral Image Based on the Coincidence Degree of Data Distribution.” International Journal of Digital Earth 16 (1): 3489–3512. https://doi.org/10.1080/17538947.2023.2251436.
Web of Science ®Google Scholar
Shao, Zhenfeng, Yueming Sun, Jiangbo Xi, and Yan Li. 2022. “Intelligent Optimization Learning for Semantic Segmentation of High Spatial Resolution Remote Sensing lmages.” Geomatics and Information Science of Wuhan University 47 (2): 234–241. https://doi.org/10.13203/j.whugis20200640.
Google Scholar
Shelhamer, E., J. Long, and T. Darrell. 2017. “Fully Convolutional Networks for Semantic Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4): 640–651. https://doi.org/10.1109/TPAMI.2016.2572683.
PubMed Web of Science ®Google Scholar
Su, Jingsuan, Liangxin Fan, Zhanliang Yuan, Zhen Wang, and Zhijun Wang. 2023. “Quantifying the drought sensitivity of grassland under different climate zones in Northwest China.” Science of the Total Environment 910: 168688. https://doi.org/10.1016/j.scitotenv.2023.168688.
PubMed Web of Science ®Google Scholar
Sun, Weiwei., and Ruisheng Wang. 2018. “Fully Convolutional Networks for Semantic Segmentation of Very High Resolution Remotely Sensed Images Combined With DSM.” IEEE Geoscience and Remote Sensing Letters 15 (3): 474–478. https://doi.org/10.1109/LGRS.2018.2795531.
Web of Science ®Google Scholar
Wang, Yiqin. 2021a. “Remote Sensing Image Semantic Segmentation Algorithm Based on Improved ENet Network.” Scientific Programming, Hindawi: e5078731. https://doi.org/10.1155/2021/5078731.
Web of Science ®Google Scholar
Wang, Zhen, Jianxin Guo, Wenzhun Huang, and Shanwen Zhang. 2021b. “High-Resolution Remote Sensing Image Semantic Segmentation Based on a Deep Feature Aggregation Network.” Measurement Science and Technology 32 (9): IOP Publishing: 095002. https://doi.org/10.1088/1361-6501/abfbfd.
Web of Science ®Google Scholar
Wang, Zhen, Zhaoqing Li, Rong Wang, Feiping Nie, and Xuelong Li. 2021c. “Large Graph Clustering With Simultaneous Spectral Embedding and Discretization.” IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (12): 4426–4440. https://doi.org/10.1109/TPAMI.2020.3002587.
PubMed Web of Science ®Google Scholar
Wang, Zhipan, Zhongwu Wang, Dongmei Yan, Zewen Mo, and Hua Zhang. 2023. “RepDDNet: A Fast and Accurate Deforestation Detection Model with High-Resolution Remote Sensing Image.” International Journal of Digital Earth 16 (1): 2013–2033. https://doi.org/10.1080/17538947.2023.2220619.
Web of Science ®Google Scholar
Wang, Zhimin, Jiasheng Wang, Kun Yang, Limeng Wang, and Fanjie Su. 2022. “Semantic Segmentation of High-Resolution Remote Sensing Images Based on a Class Feature Attention Mechanism Fused with Deeplabv3+.” Computers & Geosciences 158 (January): 104969. https://doi.org/10.1016/j.cageo.2021.104969.
Google Scholar
Wei, Pengliang, Ran Huang, Tao Lin, and Jingfeng Huang. 2022. “Rice Mapping in Training Sample Shortage Regions Using a Deep Semantic Segmentation Model Trained on Pseudo-Labels.” Remote Sensing 14 (2): 328. https://doi.org/10.3390/rs14020328.
Web of Science ®Google Scholar
Xu, Zhiyong, Weicun Zhang, Tianxiang Zhang, and Jiangyun Li. 2021. “HRCNet: High-Resolution Context Extraction Network for Semantic Segmentation of Remote Sensing Images.” Remote Sensing 13 (1), https://doi.org/10.3390/rs13010071.
PubMed Web of Science ®Google Scholar
Yang, Zenan, Haipeng Niu, Liang Huang, Xiaoxuan Wang, and Liangxin Fan. 2022. “Automatic Segmentation Algorithm for High-Spatial-Resolution Remote Sensing Images Based on Self-Learning Super-Pixel Convolutional Network.” International Journal of Digital Earth 15 (1): 1101–1124. https://doi.org/10.1080/17538947.2022.2083247.
Web of Science ®Google Scholar
Yang, Fengting, Qian Sun, Hailin Jin, and Zihan Zhou. 2020. “Superpixel Segmentation With Fully Convolutional Networks.” The Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13964–13973.
Google Scholar
Yu, Fisher., and Vladlen Koltun. 2016. “Multi-Scale Context Aggregation by Dilated Convolutions.” Proceedings of the International Conference on Learning Representations(ICLR).
Google Scholar
Yuan, X. H., J. F. Shi, and L. C. Gu. 2021. “A Review of Deep Learning Methods for Semantic Segmentation of Remote Sensing Imagery.” Expert Systems with Applications 169: 114417. https://doi.org/10.1016/j.eswa.2020.114417.
Web of Science ®Google Scholar
Zhao, Wei, Yi Fu, Xiaosong Wei, and Hai Wang. 2018. “An Improved Image Semantic Segmentation Method Based on Superpixels and Conditional Random Fields.” Applied Sciences 8 (5): 837. https://doi.org/10.3390/app8050837.
Google Scholar
Zhao, Danpei, Bo Yuan, Yue Gao, Xinhu Qi, and Zhenwei Shi. 2022. “UGCNet: An Unsupervised Semantic Segmentation Network Embedded With Geometry Consistency for Remote-Sensing Images.” IEEE Geoscience and Remote Sensing Letters 19: 1–5. https://doi.org/10.1109/LGRS.2021.3129776.
Web of Science ®Google Scholar
Zhao, Yang, Yuan Yuan, Feiping Nie, and Qi Wang. 2018. “Spectral Clustering Based on Iterative Optimization for Large-Scale and High-Dimensional Data.” Neurocomputing 318: 227–235. https://doi.org/10.1016/j.neucom.2018.08.059.
Web of Science ®Google Scholar

An unsupervised semantic segmentation method that combines the ImSE-Net model with SLICm superpixel optimization

ABSTRACT

1. Introduction

2. Methods

2.1. Implicit superpixel embedding network (ImSE-Net)

2.1.1. Preparatory

2.1.2. Implicit clustering module

Table 1. Detailed process for the ImSE-Net.

2.2. SLICm and superpixel optimization

Table 2. Detailed process for superpixel refinement.

2.3. Unsupervised semantic segmentation model UGLS

2.3.1. Global hidden pseudo-positives GHPsP

2.3.2. Local hidden pseudo-positives LHPsP

2.3.3. Loss function

Table 3. Detailed process for unsupervised semantic segmentation UGLS.

3. Experiments and discussion

3.1. Datasets

Table 4. Grouping information about datasets.

Table 5. Test set task categories.

3.2. Evaluation accuracy

Table 6. Quantitative evaluation indicators.

3.3. Analysis of results

Table 7. Description of the unsupervised semantic segmentation algorithms.

3.3.1. Segmentation test results

Table 8. Per-class result on the COCO-stuff test set.

Table 9. Evaluation scores (%) of different baseline methods based on the COCO-stuff test set.

Table 10. Comparison between baseline and state-of-the-art methods for experiment 2. The best values are highlighted in bold.

Table 11. Per-class result for experiment 3.

Table 12. Comparison between baseline and state-of-the-art methods for experiment 3.

3.3.1.1. Test results for experiment 1

3.3.1.2. Test results for experiment 2

3.3.1.3. Test results for experiment 3

3.3.1.4. Ablation study

Table 13. Experimental results with various backbone network combinations.

Table 14. Evaluation results of the hierarchical levels.

Table 15. Influence of main components on semantic segmentation results.

Table 16. Influence of different stages on semantic segmentation results.

4. Conclusion

Author contributions

Disclosure statement

Data availability statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date