Full article: Let the loss impartial: a hierarchical unbiased loss for small object segmentation in high-resolution remote sensing images

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The progress in optical remote sensing technology presents both a possibility and challenge for small object segmentation task. However, the gap between human vision cognition and machine behavior still poses an inherent constrains to the interpretation of small but key objects in large-scale remote sensing scenes. This paper summarizes this gap as a bias of the machine against small object segmentation task, called scale-induced bias. The scale-induced bias causes the degradation in the performance of conventional remote sensing image segmentation methods. Therefore, this paper applies a straightforward but innovative insight to mitigate the scale-induced bias. Specifically, we propose a universal impartial loss, which leverages the hierarchical approach to alleviate two sub-problems separately. The pixel-level statistical methodology is applied to remove the bias between the background and small objects, and an emendation vector is introduced to alleviate the bias between small object categories. Extensive experiments explicitly manifest that our method is fully compatible with the existing segmentation structures, armed with the hierarchical unbiased loss, these structures will achieve satisfactory improvement. The proposed method is validated on two benchmark remote sensing image datasets, where it achieved a competitive performance and could narrow the gap between the human vision cognition and machine behavior.

KEYWORDS:

Introduction

With the advancement of optical remote sensing data capture technology, numerous high-resolution remote sensing images (HRIs) are now being obtained from both satellite and airborne platforms (Mi & Chen, Citation2020; Tao et al., Citation2022). Higher resolution also means that humans and machines can perceive detailed feature in large-scale HRI, which presents both a challenge and an opportunity for the small object segmentation (SOS). SOS is a widely followed task in the remote sensing interpretation community, and it works to automatically extract key but small objects in large-scale remote sensing scenes. In recent years, extensive efforts (He et al., Citation2016; Long et al., Citation2015; Ronneberger et al., Citation2015) have been made to explore HRI segmentation and achieved breakthrough improvement. However, new challenges arise when the objects to be recognized and segmented are small.

Accurately segmenting small objects is an unusual task, which is due to the obvious gap between machines and human in perceiving small objects in large-scale remote sensing scenes. For example, when humans intentionally search for key objects in large-scale remote sensing scenes, such as aircrafts in airport or ships in port, the human visual perception system will selectively ignore background information and focus on these small but key objects. However, machines lack some of the inductive constraints, which leads directly to the inherent gap between the human cognition and machine behavior in solving SOS task. This inherent gap makes the SOS task inevitably suffer from performance bottlenecks. Specifically, as can be observed from the large-scale remote sensing scenes, the background usually includes more pixels than small objects (Li et al., Citation2021; Ma et al., Citation2022; Segl & Kaufmann, Citation2001). As presented in , we believe that the scale-induced bias has two levels of bias on the SOS task, the first is the bias between the background and the small object category, shown in , and the second is the bias between the small object categories, shown in .

Figure 1. The explanation of the scale-induced bias. (a) the bias between the background and the small object category. (b) the bias between the small object categories, and the value is the proportion of each category in the small objects.

When deep learning-based segmentation networks are joint with the conventional training approach to optimize the overall loss of the model, it is possible for the optimizer to derive small errors for pixels which belong to background category (Guo et al., Citation2019, Li, Huang et al., Citation2021; Rabbi et al., Citation2020). However, it is more crucial to minimize the errors for pixels which belong to small object category, and the segmentation result of small objects usually unsatisfactory, as small objects always contribute less to the overall segmentation loss. Moreover, we notice that small objects are sensitive to their surroundings (Guo et al., Citation2019), and the scale-induced bias could lead to severe performance degradation.

An ordinary strategy towards optimizing the segmentation result of small objects is to enhance the feature representation capability of the model. Fueled by the success of deep learning, many groups (Geiss et al., Citation2022; Kemker et al., Citation2018; Zhang, Wang et al., Citation2022) are dedicated to designing various robust and effective networks and these attempts have achieved satisfactory profit. However, these methods are excessively dependent on hardware resources and lack generalization ability. The stacked network structures are slightly weak in narrowing the gap between human perception and machine behavior. Notably, the ideal method to alleviate the scale-induced bias is to leverage the available data and models to better the training output. We think that the way of optimizing the training process, i.e. in the course of loss convergence, the weight is adjusted to optimize the output of small object categories, is also a feasible and effective resolution.

Inspired by the above discussion and analysis, this paper intuitively and empirically proposes a novel but simple insight to mitigate the scale-induced bias. Specifically, we introduce a hierarchical unbiased loss function (HU-Loss) to tune the weights of each category in the training process. In HU-Loss, we apply the principal component analysis (PCA) method (Huang et al., Citation2022; Sabzi et al., Citation2013) to remove the bias between the background and small object categories, and then obtain the initial weight of each category in the loss function. PCA method can find feature representations in the data that are easier for people to understand and speed up the processing of valuable information. These features can highlight small object representation in large-scale remote sensing image data, and thus effectively alleviate the bias between the background and small object categories. Considering the bias between the small object categories, we further propose an emendation vector to alleviate the imbalance between each small object category. To sum up, our primary contributions can be summarized as follows:

We rethink the constraints and specialness of SOS task. The inherent gap between human vision cognition and machine behavior still arises challenges for SOS task. Not limited to designing stacked network structures, optimizing the training process is also an ideal way to improve segmentation accuracy, which sheds light on future works.
We tune the training process by optimizing the loss function. The proposed unbiased loss function utilizes a hierarchical approach to alleviate scale-induced bias, which can prevent the performance of the SOS task from collapsing.
We validate the proposed hierarchical unbiased loss function on some high-profile deep learning-based segmentation networks. Fortunately, compared to the baseline, these structures achieved an uplifting improvement when combined with the proposed loss function.

Related Work

Semantic segmentation

Semantic segmentation is a foundational task in the computer vision community, and this task is devoted to labelling each pixel within the images according to a set of pre-given categories (Chen et al., Citation2022; Yuan et al., Citation2013). The conventional statistical pattern recognition approaches (Tuia et al., Citation2012; Yi et al., Citation2012; Zhang et al., Citation2014) are able to utilize low-level features to generate semantic representations. However, these methods are sensitive to artificial characteristics. In recent years, tremendous advancement has been made benefited from the convolutional neural network (CNN). The fully convolutional network (FCN) adapts to dense prediction task by replacing the convolutional layers with fully connected ones, and the FCN and its extensions (Long et al., Citation2015; Schuegraf & Bittner, Citation2019; Tian et al., Citation2021) represent a milestone breakthrough in this filed. Multiple network designs have brought some delicate techniques, for instance, skip connection (Li, Zheng, Duan et al., Citation2022; Ronneberger et al., Citation2015), atrous convolution (Chen et al., Citation2017, Citation2018; Zhao et al., Citation2017), attention mechanism (He et al., Citation2022; Vaswani et al., Citation2017; Woo et al., Citation2018), and encoder-decoder structure (Chen et al., Citation2021; Long et al., Citation2015; Wang et al., Citation2021). In this paper, we also consider the generalization ability of the proposed method with respect to some basic semantic segmentation networks. In recent years, Transformer has demonstrated its great potential in global information modelling (Song et al., Citation2023; Wang et al., Citation2023). In SOS task, small objects are insensitive to global features due to the tiny size of objects and the weak correlation between small objects. Therefore, local features have a greater influence on small objects.

Small object segmentation

The development of satellite and sensor imaging technology makes it possible to capture high resolution remote sensing images. These images contain clearer and more accurate information about ground objects, which poses a feasibility to SOS task in large-scale remote sensing scene (Chong et al., Citation2022; Neupane et al., Citation2021). Compared with the conventional segmentation task, SOS task is more likely to result in object identification problems. Many attempts in this field are, directly utilized or modified from available CNN structures, without taking into account the intricacy of remote sensing data and SOS task. Therefore, the restricted view limits the performance of SOS task to a certain extent. To alleviate this issue, extensive efforts have been made to design novel network architectures. For example, Chong et al. (Citation2022) introduced that the small objects have a greater chance of being completely obscured by masks. In order to distinguish these small objects and refine their boundary, a context union edge network was proposed, and harvested state-of-the-art results. Ma et al. (Citation2022) analyzed the key objects in the large-scale HRI always contain only a few pixels and designed a dual-branch decoder, which has achieved satisfactory improvement in SOS task. Unlike modifying advanced network structure or designing new one, in this article, we propose the hierarchical unbiased loss with generalization ability to alleviate the imbalanced sample problem in large-scale remote sensing scene.

Methodology

Existing advanced network designs over-reliance on hardware resources and lack of generalization capability, which results in the weakness in resolving scale-induced bias. The proposed HU-Loss will effectively mitigate the scale-induced bias and provide better generalization capability. In this section, we introduce the hierarchical idea and the new adapted loss function in Section 3.1 and Section 3.2, respectively.

Hierarchical solution

As shown in , the scale-induced bias consists of two layers, which presents new challenges for solutions. There are huge challenges and restrictions in solving multi-level biases simultaneously, such as the same operation is not necessarily applicable to different sample scales and each of the two biases faces a different scale of problems.

Figure 2. Workflow of calculating the hierarchical unbiased weight. (a) the first layer, which is used to eliminate the bias between the background and the small object category. This layer yields the initial weights w′_s and w′_b. (b) the second layer, which is used to eliminate the bias between the small object categories. In this layer, the emendation vector Ψ is introduced to obtain the final weights, w₁,…., w_n, for each category.

Fortunately, the hierarchical solution will address exactly two layers of bias. The hierarchical idea decomposes the intricate work and then assigns subparts to each specialized unit, which allows layers to be independent of each other, and changes in one layer do not affect the other (An et al., Citation2020; Bongiorno et al., Citation2022; Kuang et al., Citation2021). Specifically, we divide the whole system into two subsystems, which are used to remove biases at different levels. One subsystem is dedicated to the elimination of the bias between the background and the small object category, and the bias between the small object categories will be tackled by another one. In Section 3.2, we will detail the proposed loss function based on the above hierarchical solution.

HU-loss function

The cross-entropy (CE) is derived in the field of machine learning as the difference between the predicted probability distribution generated in training and the true probability distribution (Abdollahi et al., Citation2021; Abid et al., Citation2021). However, the SOS task raises new challenges to the conventional CE loss function. The traditional CE loss function can equally treat each category within a large-scale remote sensing scene, which may lead the optimizer to derive small errors for pixels which belong to background category.

The proposed HU-Loss function gives weights to each category, which is able to alleviate the scale-induced bias. The proposed Loss_hu can be derived as

(1)

L o s s_{h u} (p, y, w) = - Σ_{c} p_{c} l o g (y_{c}) * w_{c},

(1)

where 1≤ c ≤ n denotes the total category number, p_c∈[0,1] is the network’s estimated probability for category c with label y_c = 1, and y_c∈{0,1} is a symbolic function that designates the ground-truth class. w_c represents the weight of category c.

As presented in is the optimization process using CE Loss, and (b) is the process using HU-Loss. Although they both achieve convergence, (a) ignores the small but key object A. Obviously, it can be seen that the convergence of the loss in (a) is mainly due to the fact that large-scale white background regions are identified; however, the key object A contributes less to the overall loss. In fact, (a) is extremely hard to capture small objects, which posed a great challenge for SOS task. In contrast, following the introduction of the HU-Loss in (b), the misjudgment of the small object A can be directly reflected by the elevated loss value, which further enables the optimizer to improve the optimization output. In the 3rd, 4th and 5th steps of the optimization process, if small object A is overlooked, it can cause fluctuations in the loss regression process, which enables the optimizer to adjust accordingly in the next stage. By doing so, armed with the proposed HU-Loss, these models are able to pay more attention to the small objects.

Figure 3. Sketches of the two loss function optimization processes. (a) CE Loss, and (b) HU-Loss. The pink and gray objects in the figure are small objects that are identified.

An overview of the workflow for calculating the hierarchical unbiased weight w_c is shown in . First, we make pixel-level statistics for the training samples and the statistics M can be used as input for the first layer. The first layer is utilized to eliminate the bias between the background and the small object category. In this layer, the PCA method is employed to analyze M and obtain initial weight w′_s and w′_b. In our model, we consider the different categories as different feature dimensions, and PCA-based method can analyze and define the weight value of each feature dimension. w′_s, w′_b and M can further serve as the input of the second layer.

The output of the first layer can be defined as a set W′ = {w′₁, … , w′_n}, and the n^th category is the background. We determine w′_s and w′_b according to

(2)

w^{_{^{'}}}_{s} = Σ_{c} w^{_{^{'}}}_{c}, 1 \leq c \leq n - 1,

(2)

(3)

w^{_{^{'}}}_{b} = w^{_{^{'}}}_{n} .

(3)

In the second layer, we propose an emendation vector Ψ to remove the bias between the small object categories and obtain the final weight w_c. The statistical result of M in the second layer can be expressed as

(4)

P = {p_{c} | p_{c} = λ (M), 1 \leq c \leq n - 1},

(4)

where λ(Δ) represents the statistical process, and p_c is the proportion of category c in all small object categories. One of the primary operations is to define Ψ, and we derive ψ_c as

(5)

ψ_{c} = [θ (P) - p_{c}]/[(n - 2) * θ (P)],

(5)

where ψ_c is the emendation component of Ψ for each small object category, and θ(Δ) means to sum the elements in the set.

The output of the second layer can be defined as a set W = {w₁, … , w_n}, and we determine w_c according to

(6)

w_{c} = ψ_{c} * w^{_{^{'}}}_{s}, 1 \leq c \leq n - 1,

(6)

(7)

w_{c} = w^{_{^{'}}}_{b}, c = n .

(7)

The final weights, w₁, … , w_n, can be combined with Equationequation (1)(1) $L o s s_{h u} (p, y, w) = - Σ_{c} p_{c} l o g (y_{c}) * w_{c},$ (1) to determine the HU-Loss function. Higher entropy results in higher uncertainty of the output of the network, indicating that the network does not learn well. In contrast, lower entropy results in lower uncertainty of the network output and a more accurate estimation. After assigning weights to each category, the training process can also pay more attention to small objects in the process of reducing loss.

Experimental results and analysis

Dataset and evaluation index

In order to assess the effectiveness of the proposed HU-Loss, the validation is conducted with two benchmark large-scale remote sensing scenes datasets.

iSAID

This dataset was revised from a large-scale detection dataset, and it is dedicated to the small object semantic segmentation task. The iSAID dataset (Ma et al., Citation2022) uses the pictures in the DOTA dataset (Xia et al., Citation2018) for pixel-level annotation, which corrects the labeling errors in the DOTA dataset. This dataset fully reflects the common features and scale distribution differences in remote sensing images. The size of the images ranges from 12,029 × 5014 to 455 × 387. Due to the limitation of cache memory, we divide the original images into images with size of 512 × 512 during training.

ISPRS Vaihingen

The ISPRS Vaihingen dataset has 33 images, each with three bands, corresponding to near-infrared, red, and green wavelengths, and the ground sampling is 9 cm (Zhao et al., Citation2021). This dataset was captured by the airborne camera and was previously used for HRI semantic segmentation task. After further processing, houses and vehicles were considered small objects, while vegetation and road surfaces were considered backgrounds.

Accuracy and efficiency are significant indices for the evaluation of SOS task. Therefore, OA, IoU and mIoU are selected as objective indexes (Li, Zheng, Zhang et al., Citation2022; Zhang, Jiang et al., Citation2022). OA represents the proportion of all pixels in a prediction map that can match the corresponding category in the ground truth, and intersection over union (IoU) is the standard index of the semantic segmentation. mIoU is the mean of IoU over all categories, which can be calculated as

(8)

I o U_{c} = x_{c c} / Σ_{c} x_{c d} + Σ_{c} (x_{d c} - x_{c c}), 1 \leq c \leq n,

(8)

(9)

m I o U = Σ_{c} I o U_{c} / n, 1 \leq c \leq n,

(9)

where x_cd is the number of pixels of category c predicted as category d, and n is the number of categories.

However, the mIoU is not fully suitable for all SOS task, because the inherent bias also has some effect on evaluation metrics, the simple categories, i.e. the background, tend to mislead the overall evaluation, and we prefer to accurately evaluate the result of SOS task. Motivated by this, this paper experimentally proposes a new evaluation metric, called small object unbiased intersection over union (sIoU), which is dedicated to assessing the result of SOS. In sIoU, we ignore the background to calculate only the small object category. However, the bias within the small objects still exists, so we introduce the emendation vector Ψ again, i.e, Eq. 5. The sIoU is expressed as

(10)

s I o U = Σ_{c} (I o U_{c} * ψ_{c}), 1 \leq c \leq n - 1.

(10)

Comparison experiments

iSAID

In this section, we employ some popular segmentation networks (e.g. U-Net (Ronneberger et al., Citation2015), FCN8s (Long et al., Citation2015), PSPNet (Zhao et al., Citation2017), and Segmenter (Strudel et al., Citation2021)) to validate the effectiveness and generalization ability of the proposed HU-Loss function. Under the premise of the same network structure, this paper will compare the experimental results generated by adopting the conventional CE loss function and the proposed HU-Loss function in training, respectively. Furthermore, the selected networks are all based on different backbones, which can explicitly illustrate that the proposed HU-Loss has outstanding robustness and the generalization ability. The actual segmentation results in six categories (i.e. background, small vehicle, large vehicle, plane, helicopter and ship). In order to facilitate the expression, the abbreviations are as follows: BG-Background, SV-Small Vehicle, LV-Large Vehicle, HL-Helicopter.

The results of the quantitative analysis of different models are presented in . As shown in , directly employing existing networks for the small object segmentation task will obtain unsatisfactory results. The initial input of U-Net is a 3-channel image with four scale depths. ResNet and VGG backbone are selected in FCN8s and PSPNet, respectively. ResNet backbone has better accuracy and speed of inference, while VGG has a simpler structure, and the combination of several small filters is superior to one large filter. U-Net achieves worst result, which is 11.63% lower than the baseline with HU-Loss. Although the FCN8s performs pretty well, the FCN8s with HU-Loss achieves gains of 11.29% in mIoU. It is noticed that the gains of the PSPNet in mIoU is slightly less profitable than the other two networks, but the gains of 4.02% in mIoU are a satisfactory result. Notably, in terms of OA, the addition of HU-Loss brings improvement. However, in PSPNet, the baseline with HU-Loss does not achieve gains due to some categories being adversely affected.

Table 1. Performance comparison of the different models on iSAID. Baseline means the original network with CE Loss, and green (+XX) and red (−YY), respectively, represent the performance improvement or decrease caused by HU-Loss.

Download CSV Display Table

The performance of different models in each category is listed in . U-Net performs poorly due to its shallow layers, especially the helicopter category. However, the U-Net with HU-Loss achieves relatively good results, the gains of 29.11% in the helicopter category are an inspiring result. The FCN8s with HU-Loss compared to FCN8s is a pretty comprehensive improvement. It can be seen that FCN8s has improved in all small object categories after combining with HU-Loss. The PSPNet with HU-loss has also made some progress, but it suffers some adverse effects in large and small vehicle categories after binding HU-Loss. The expansion of the receptive field will make the network fail in perceiving some small objects, which can cause these objects are difficult to be distinguished in large-scale remote sensing scene. This issue will be a part of our future work. In SOS task, small objects are insensitive to global features due to the tiny size of objects and the weak correlation between small objects. Therefore, local features have greater influence on small objects, which leads to unsatisfactory effect of Transformers-based models for this task.

Table 2. Performance comparison of the different models in per class on iSAID. Baseline means the original network with CE Loss, and green (+XX) and red (−YY), respectively, represent the performance improvement or decrease caused by HU-Loss.

Download CSV Display Table

To intuitively present the differences with or without HU-Loss, we, respectively, select two typical results from airport scene and harbor scene. The results of the qualitative analysis are shown in .

Figure 4. Visualization results on the iSAID airport scene. (a) original image, (b) ground truth, (c) U-Net, (d) U-Net w/HU-Loss, (e) FCN8s, (f) FCN8s w/HU-Loss, (g) PSPNet, (h) PSPNet w/HU-Loss (i) Segmenter and (j) Segmenter w/HU-Loss.

Figure 5. Visualization results on the iSAID harbor scene. (a) original image, (b) ground truth, (c) U-Net, (d) U-Net w/HU-Loss, (e) FCN8s, (f) FCN8s w/HU-Loss, (g) PSPNet, (h) PSPNet w/HU-Loss (i) Segmenter and (j) Segmenter w/HU-Loss.

As shown in , it can be seen that small objects are separately distributed. The plane category is relatively easy to recognize, and this segmentation network with HU-Loss achieves better and more refined performance. However, the small and large vehicle categories are difficult to distinguish, which leads to severe performance degradation. It is noticed that the small and large vehicles are very small in the large-scale remote sensing scene and often overlooked by conventional methods. With the adoption of HU-Loss, the recognition of these small objects can be improved. Furthermore, we further discuss the detailed a challenge, since helicopters are scarce in this scene. The U-Net directly misjudges all helicopters. Fortunately, the U-Net with HU-Loss identifies the helicopter within this scene, which explicitly shows that HU-Loss is a significant refinement to the conventional approach.

As shown in , we present the segmentation results in harbor scene. Unlike planes, ships are more diverse in appearance, and the distribution of these ships is irregular, which arises new challenges to the segmentation of ships. FCN8s misclassifies the ships into large vehicles because the ships and large vehicles have similar colors. The FCN8s with HU-Loss can accurately segment the ships from the large-scale remote sensing scene, and HU-Loss plays a key role in it. The ship category is scarce in the training samples, which reveals the reality that the plentiful biased samples mislead the model optimization. However, the U-Net with HU-Loss achieves the relatively poor performance, which will be investigated as part of our future work.

ISPRS vaihingen

In previous comparison experiments on iSAID dataset, the conventional structures with HU-Loss are able to recognize tricky small objects, while the baseline networks are hardly able to do so. Further, due to the different properties of each remote sensing dataset, resolution, band combinations, etc., some baseline networks may have made more satisfactory output on the dataset, but there is still room for improvement. Not limited to substantial improvement of experimental results, further refinement of experimental outputs that has a better foundation is also one of the important manifestations of the generalization ability of the method. Therefore, we further conduct experiments on the ISPRS Vaihingen dataset due to its popularity, the categories and band combinations.

present the results of the quantitative analysis on Vaihingen dataset. From an overall perspective, these structures are used for the comparison experiments all exhibited different degrees of improvement after combining with HU-Loss. U-Net and FCN8s are improved in all indexes after combining HU-Loss, where FCN8s gains 2.26% in sIoU, and U-Net gains 8.81% in sIoU, which is a considerable experimental output. After combining PSPNet with HU-Loss, there is a satisfactory improvement in both mIoU and sIoU. However, it fluctuates slightly in OA, which is still due to the effect of the atrous convolution on small objects. The performance of segmenter is slightly lower than that of other methods. As mentioned in Section 4.2.1, global features and local features have different impacts on SOS tasks. Segmenter w/HU-Loss still made considerable improvements in the car category due to their tiny size. How to overcome the effect of the atrous convolution and the Transformers-based methods on small objects will be a problem that we need to continue to study in the future.

Table 3. Performance comparison of the different models on ISPRS Vaihingen. Baseline means the original network with CE Loss, and green (+XX) and red (−YY), respectively, represent the performance improvement or decrease caused by HU-Loss.

Download CSV Display Table

Table 4. Performance comparison of the different models in per class on ISPRS Vaihingen. Baseline means the original network with CE Loss, and green (+XX) and red (−YY), respectively, represent the performance improvement or decrease caused by HU-Loss.

Download CSV Display Table

Specifically, as shown in , it is noticed that almost each category has been satisfactorily enhanced. In particular, the vehicle category gains the most significant enhancement, due to the fact that the proposed method allowed the model to give the vehicles more attention during training. U-Net with HU-Loss gains 9.17% in vehicle category, which is an uplifting improvement. However, after PSPNet is combined with HU-Loss, the background category performs poorly, which will be a part of our future work.

As presented in , we show the qualitative results on ISPRS Vaihingen. Overall, the individual models are able to identify most of the houses and vehicles. There are many cases of misclassification of houses in FCN8s and PSPNet. Where in (g) and (h), the baseline network, when combined with HU-Loss, allows for more accurate differentiation of the houses, as marked by the yellow boxes. Meanwhile, as marked by the yellow boxes in (c), (d), (e) and (f), FCN8s and U-Net are more accurate in distinguishing tiny objects in large-scale remote sensing scenarios. These tiny objects, i.e. vehicles, though small, are indeed the key objects for the SOS task. Through qualitative analysis, we can know that our proposed method is effective.

Figure 6. Visualization results on the ISPRS Vaihingen. (a) original image, (b) ground truth, (c) U-Net, (d) U-Net w/HU-Loss, (e) FCN8s, (f) FCN8s w/HU-Loss, (g) PSPNet, (h) PSPNet w/HU-Loss, (i) Segmenter and (j) Segmenter w/HU-Loss.

In summary, for the original experimental output with a better foundation, the proposed HU-Loss is still able to provide continuous refinement, which is important for the further improvement of the accuracy of SOS task.

The analysis of small objects and background division effects

In the previous analysis, due to the uniqueness of SOS task, the key to optimizing the output of SOS task is to make the model focus more on the small objects rather than the background, which will provide an effective mimicry to match the human visual sensitivity to small objects. To verify the contribution of HU-Loss to such imitation, we analyze the output features of FCN8s for the port scene as an example. Specifically, we select the small object channels in the output feature map for visualization, and in the visualization result, the brighter the area indicates the higher probability of belonging to the small objects. On the contrary, the darker the area is, the lower the probability. The visualization results are shown in .

Figure 7. The feature map visualization results of FCN8s for the port scene. (a) original image, (b) reference binary chart (black is the background, white is the small objects), (c) FCN8s, and (d) FCN8s w/HU-Loss.

It is obvious to see that the overall result is much improved after the introduction of HU-Loss. Although the results of baseline (c) and baseline with HU-Loss (d) are able to identify small objects, baseline models still are prone to focus on the background and degrade the performance. In addition, in the top right of the scene, the baseline confuses the background with the ship, which leads directly to a misjudgment of these small objects. Furthermore, the baseline feature map visualization results are brighter overall, which means that baseline model does not do a good job of separating the small objects from the background. Unlike the general HRI segmentation task, SOS is more susceptible to the background interference, which leads to the small objects’ segmentation collapse. In contrast, the baseline combined with HU-Loss achieves a better performance. In , it can be seen that the small objects are marked and the background is effectively restricted, i.e. the background area is darker, which is the more ideal state in SOS task, where the model pays more attention to small objects.

Overall, the proposed HU-Loss will aid the original model to more effectively identify small objects in large-scale remote sensing images, and at the same time can provide some constraints to the background. In this way, the original structures, equipped with HU-Loss, will be better able to perform SOS task.

The analysis of experimental process

In this section, we further discuss the detailed progress of the experiment, which will verify the effectiveness of the proposed HU-Loss. The iSAID dataset is used as an example to conduct this analysis, and the conventional mIoU and the newly proposed sIoU in the present paper are still the indicators of the analysis. We will illustrate the regression process for small objects and the stability of each method combined with the HU-Loss. The detailed experimental results are shown in .

Figure 8. With increasing number of iterations, mIou and sIou with different networks with or without HU-Loss. (a) U-Net, (b) FCN8s, and (c) PSPNet.

As shown in , it is obvious that at almost every node, the experimental output from the baseline with HU-Loss is better than the without, which explicitly indicates that the proposed HU-Loss is able to be well incorporated with these deep learning-based architectures and plays a positive role. In addition, we can find that as the number of iterations increases, the experimental outputs become better, which means that HU-Loss does not make the original network degraded or unstable. It is noticed that the fluctuation of the PSPNet is somewhat large. This is due to the fact that the atrous convolution in the PSPNet will ignore many small objects, which makes itself will be fitted faster and very unstable in the process afterwards. Moreover, the proposed HU-Loss can make sIoU continuously fit mIoU, which is a good indication that HU-Loss will narrow the gap between background and small objects in loss, i.e. HU-Loss can make the regression process pay more attention to small objects.

In conclusion, the proposed HU-Loss will play a positive and significant role when it incorporated with conventional deep learning structures, and HU-Loss can enable these networks to alleviate the bias against small objects during training and thus effectively improve the effectiveness of SOS task, which significantly narrow the gap between human visual perception and machine behavior.

Conclusion

In this paper, we proposed an unbiased loss function to alleviate the scale-induced bias in SOS task. The hierarchical solution was employed to mitigate the two sub-problems in the scale-induced bias. Moreover, we extended the proposed method to a more challenging generalized setting and produced uplifting improvements compared to the baseline.

In the future, we hope that our work may shed fresh light on small objects interpretation and enlighten the design of loss constraints for weak object detection, weak information extraction, special feature recognition, etc. At the same time, how to balance the segmentation accuracy between background and small objects will be the issue to be addressed in the further step of our work. In addition, how to make the model based on Transformers more suitable for SOS task will also be a problem that we need to continue investigating about in the future.

Acknowledgments

All authors would sincerely thank the reviewers and editors for their suggestions and opinions for improving this article.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Thanks to the providers of the dataset. The data presented in this study are openly available at http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html and https://captain-whu.github.io/iSAID/index.html. The data used to support this work will be provided in the https://github.com/CVFishwgy/HU-loss.

Additional information

Funding

This research was funded by the National Natural Science Foundation of China under [Grant 62072391] and [Grant 62066013] and the Graduate Science and Technology Innovation Fund Project of Yantai University under [Grant GGIFYTU2320].

References

Abdollahi, A., Pradhan, B., & Alamri, A. (2021). RoadVecNet: A new approach for simultaneous road network segmentation and vectorization from aerial and google earth imagery in a complex urban set-up. Giscience & Remote Sensing, 58(7), 1151–14. https://doi.org/10.1080/15481603.2021.1972713
Web of Science ®Google Scholar
Abid, N., Shahzad, M., Malik, M. I., Schwanecke, U., Ulges, A., Kovacs, G., & Shafait, F. (2021). UCL: Unsupervised curriculum learning for water body classification from remote sensing imagery. International Journal of Applied Earth Observation and Geoinformation, 105, 102568. https://doi.org/10.1016/j.jag.2021.102568
Web of Science ®Google Scholar
An, N., Domel, A. G., Zhou, J. X., Rafsanjani, A., & Bertoldi, K. (2020). Programmable hierarchical kirigami. Advanced Functional Materials, 30(6), 1906711. https://doi.org/10.1002/adfm.201906711
Web of Science ®Google Scholar
Bongiorno, C., Micciche, S., & Mantegna, R. N. (2022). Statistically validated hierarchical clustering: Nested partitions in hierarchical trees. Physica A-Statistical Mechanics and Its Applications, 593, 126933. https://doi.org/10.1016/j.physa.2022.126933
Web of Science ®Google Scholar
Chen, B. Y., Xia, M., Qian, M., & Huang, J. Q. (2022). Manet: A multi-level aggregation network for semantic segmentation of high-resolution remote sensing images. International Journal of Remote Sensing, 1–21. https://doi.org/10.1080/01431161.2022.2073795
Web of Science ®Google Scholar
Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV).
Google Scholar
Chen, Z. Y., Wang, C., Li, J., Fan, W. T., Du, J. X., & Zhong, B. N. (2021). Adaboost-like end-to-end multiple lightweight U-nets for road extraction from optical remote sensing images. International Journal of Applied Earth Observation and Geoinformation, 100, 102341. https://doi.org/10.1016/j.jag.2021.102341
Web of Science ®Google Scholar
Chong, Q., Xu, J., Jia, F., Liu, Z., Yan, W., Wang, X., & Song, Y. (2022). A multiscale fuzzy dual-domain attention network for urban remote sensing image segmentation. International Journal of Remote Sensing, 43(14), 5480–5501. https://doi.org/10.1080/01431161.2022.2135413
Web of Science ®Google Scholar
Chong, Y. W., Chen, X. S., & Pan, S. M. (2022). Context union edge network for semantic segmentation of small-scale objects in very high-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2020.3021210
Web of Science ®Google Scholar
Geiss, C., Zhu, Y., Qiu, C. P., Mou, L. C., Zhu, X. X., & Taubenbock, H. (2022). Deep relearning in the geospatial domain for semantic remote sensing image segmentation. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2020.3031339
Web of Science ®Google Scholar
Guo, D. Z., Zhu, L. G., Lu, Y. H., Yu, H. K., & Wang, S. (2019). Small object sensitive segmentation of urban street scene with spatial adjacency between object classes. IEEE Transactions on Image Processing, 28(6), 2643–2653. https://doi.org/10.1109/TIP.2018.2888701
Web of Science ®Google Scholar
He, H. J., Gao, K. Y., Tan, W. K., Wang, L. Y., Chen, N., Ma, L. F., & Li, J. A. T. (2022). Super-resolving and composing building dataset using a momentum spatial-channel attention residual feature aggregation network. International Journal of Applied Earth Observation and Geoinformation, 111, 102826. https://doi.org/10.1016/j.jag.2022.102826
Web of Science ®Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Google Scholar
Huang, D. S., Jiang, F. W., Li, K. P., Tong, G. S., & Zhou, G. F. (2022). Scaled PCA: A new approach to dimension reduction. Management Science, 68(3), 1678–1695. https://doi.org/10.1287/mnsc.2021.4020
Web of Science ®Google Scholar
Kemker, R., Luu, R., & Kanan, C. (2018). Low-shot learning for the semantic segmentation of remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 56(10), 6214–6223. https://doi.org/10.1109/TGRS.2018.2833808
Web of Science ®Google Scholar
Kuang, Z. Z., Zhang, X., Yu, J., Li, Z. M., & Fan, J. P. (2021). Deep embedding of concept ontology for hierarchical fashion recognition. Neurocomputing, 425, 191–206. https://doi.org/10.1016/j.neucom.2020.04.085
Web of Science ®Google Scholar
Li, R., Zheng, S. Y., Duan, C. X., Su, J. L., & Zhang, C. (2022). Multistage attention resU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2021.3063381
Web of Science ®Google Scholar
Li, R., Zheng, S. Y., Zhang, C., Duan, C. X., Su, J. L., Wang, L. B., & Atkinson, P. M. (2022). Multiattention network for semantic segmentation of fine-resolution remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–13. https://doi.org/10.1109/TGRS.2021.3093977
Web of Science ®Google Scholar
Li, X., He, H., Li, X., Li, D., Cheng, G., Shi, J., Weng, L., Tong, Y., & Lin, Z. (2021). PointFlow: Flowing semantics through points for aerial image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Google Scholar
Li, Y. Y., Huang, Q., Pei, X., Chen, Y. Q., Jiao, L. C., & Shang, R. H. (2021). Cross-layer attention network for small object detection in remote sensing imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 2148–2161. https://doi.org/10.1109/JSTARS.2020.3046482
Google Scholar
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Google Scholar
Ma, A. L., Wang, J. J., Zhong, Y. F., & Zheng, Z. (2022). FactSeg: Foreground activation-driven small object semantic segmentation in large-scale remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–16. https://doi.org/10.1109/TGRS.2021.3097148
Web of Science ®Google Scholar
Mi, L., & Chen, Z. Z. (2020). Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation. ISPRS Journal of Photogrammetry and Remote Sensing, 168, 153–155. https://doi.org/10.1016/j.isprsjprs.2020.08.015
Web of Science ®Google Scholar
Neupane, B., Horanont, T., & Aryal, J. (2021). Deep learning-based semantic segmentation of urban features in satellite images: A review and meta-analysis. Remote Sensing, 13(4), 808. https://doi.org/10.3390/rs13040808
Web of Science ®Google Scholar
Rabbi, J., Ray, N., Schubert, M., Chowdhury, S., & Chao, D. (2020). Small-object detection in remote sensing images with end-to-end edge-enhanced GAN and object detector network. Remote Sensing, 12(9), 1432. https://doi.org/10.3390/rs12091432
Web of Science ®Google Scholar
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention (MICCAI).
Google Scholar
Sabzi, S., Javadikia, P., Rabani, H., & Adelkhani, A. (2013). Mass modeling of Bam orange with ANFIS and SPSS methods for using in machine vision. Measurement, 46(9), 3333–3341. https://doi.org/10.1016/j.measurement.2013.06.005
Web of Science ®Google Scholar
Schuegraf, P., & Bittner, K. (2019). Automatic building footprint extraction from multi-resolution remote sensing images using a hybrid FCN. ISPRS International Journal of Geo-Information, 8(4), 191. https://doi.org/10.3390/ijgi8040191
Web of Science ®Google Scholar
Segl, K., & Kaufmann, H. (2001). Detection of small objects from high-resolution panchromatic satellite imagery based on supervised image segmentation. IEEE Transactions on Geoscience and Remote Sensing, 39(9), 2080–2083. https://doi.org/10.1109/36.951105
Web of Science ®Google Scholar
Song, P., Li, J., An, Z., Fan, H., & Fan, L. (2023). Ctmfnet: CNN and Transformer multi-scale fusion network of remote sensing urban scene imagery. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–14. https://doi.org/10.1109/TGRS.2022.3232143
Google Scholar
Strudel, R., Ricardo, G., Ivan, L., & Cordelia, S. (2021). Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Google Scholar
Tao, C. X., Meng, Y. Z., Li, J. J., Yang, B. B., Hu, F. M., Li, Y. X., Cui, C. L., & Zhang, W. (2022). Msnet: Multispectral semantic segmentation network for remote sensing images. Giscience & Remote Sensing, 59(1), 1177–1198. https://doi.org/10.1080/15481603.2022.2101728
Web of Science ®Google Scholar
Tian, T., Chu, Z. Q., Hu, Q., & Ma, L. (2021). Class-wise fully convolutional network for semantic segmentation of remote sensing images. Remote Sensing, 13(16), 3211. https://doi.org/10.3390/rs13163211
Web of Science ®Google Scholar
Tuia, D., Munoz-Mari, J., & Camps-Valls, G. (2012). Remote sensing image segmentation by active queries. Pattern Recognition, 45(6), 2180–2192. https://doi.org/10.1016/j.patcog.2011.12.012
Web of Science ®Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Proceedings of the Neural Information Processing Systems (NIPS).
Google Scholar
Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X., & Atkinson, P. M. (2023). UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 190, 196–214. https://doi.org/10.1016/j.isprsjprs.2022.06.008
Web of Science ®Google Scholar
Wang, Y., Sun, Z. C., & Zhao, W. (2021). Encoder- and decoder-based networks using multiscale feature fusion and nonlocal block for remote sensing image semantic segmentation. IEEE Geoscience and Remote Sensing Letters, 18(7), 1159–1163. https://doi.org/10.1109/LGRS.2020.2998680
Web of Science ®Google Scholar
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV).
Google Scholar
Xia, G.-S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., & Zhang, L. (2018). DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Google Scholar
Yi, L., Zhang, G. F., & Wu, Z. C. (2012). A scale-synthesis method for high spatial resolution remote sensing image segmentation. IEEE Transactions on Geoscience and Remote Sensing, 50(10), 4062–4070. https://doi.org/10.1109/TGRS.2012.2187789
Web of Science ®Google Scholar
Yuan, J. Y., Gleason, S. S., & Cheriyadat, A. M. (2013). Systematic benchmarking of aerial image segmentation. IEEE Geoscience and Remote Sensing Letters, 10(6), 1527–1531. https://doi.org/10.1109/LGRS.2013.2261453
Web of Science ®Google Scholar
Zhang, C., Jiang, W. S., Zhang, Y., Wang, W., Zhao, Q., & Wang, C. J. (2022). Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–20. https://doi.org/10.1109/TGRS.2022.3144894
Web of Science ®Google Scholar
Zhang, D. D., Wang, C. P., & Fu, Q. (2022). Near-shore ship segmentation based on I-Polar mask in remote sensing. International Journal of Remote Sensing, 43(9), 3470–3489. https://doi.org/10.1080/01431161.2022.2095878
Web of Science ®Google Scholar
Zhang, X. L., Xiao, P. F., Feng, X. Z., Wang, J. G., & Wang, Z. (2014). Hybrid region merging method for segmentation of high-resolution remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 98, 19–28. https://doi.org/10.1016/j.isprsjprs.2014.09.011
Web of Science ®Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1016/j.isprsjprs.2014.09.011
Google Scholar
Zhao, T. Y., Xu, J. D., Chen, R., & Ma, X. Y. (2021). Remote sensing image segmentation based on the fuzzy deep convolutional neural network. International Journal of Remote Sensing, 42(16), 6267–6286. https://doi.org/10.1080/01431161.2021.1938738
Web of Science ®Google Scholar

Let the loss impartial: a hierarchical unbiased loss for small object segmentation in high-resolution remote sensing images

ABSTRACT

Introduction