Full article: Deep hierarchical transformer for change detection in high-resolution remote sensing images

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Deep learning instantiated by convolutional neural networks has achieved great success in high-resolution remote-sensing image change detection. However, such networks have a limited receptive field, being unable to extract long-range dependencies in a scene. As the transformer model with self-attention can better describe long-range dependencies, we introduce a hierarchical transformer model to improve the precision of change detection in high-resolution remote sensing images. First, the hierarchical transformer extracts abstract features from multitemporal remote sensing images. To effectively minimize the model’s complexity and enhance the feature representation, we limit the self-attention calculation of each transformer layer to local windows with different sizes. Then, we combine the features extracted by the hierarchical transformer and input them into a nested U-Net to obtain the change detection results. Furthermore, a simple but effective model fusion strategy is adopted to improve the change detection accuracy. Extensive experiments are carried out on two large-scale data sets for change detection, LEVIR-CD and SYSU-CD. The quantitative and qualitative experimental results suggest that the proposed method outperforms the advanced methods in terms of detection performance.

KEYWORDS:

Introduction

Change detection in high-resolution remote sensing images has a wide range of applications in tasks such as land use surveys, geographic spatial database updating, and disaster monitoring and assessment, being a main research topic in remote-sensing image processing and analysis (W. Shi et al., Citation2020). For matched bi-temporal remote sensing images, change detection aims to distinguish variations in corresponding pixel pairs. Usually, the areas that have and have not changed are labeled with 1 and 0, respectively (Ke & Zhang, Citation2021).

The supervised change detection method can exactly specify the change areas of interest via annotated samples and has the advantage of a good detection effect. Therefore, the supervised change detection method has been widely studied. From the perspective of technical methods, the methods can be divided into traditional machine-learning methods and deep learning-based methods (He et al., Citation2020). The traditional machine learning methods correspond to hand-craft features and the deep learning-based methods correspond to deep features. Traditional machine-learning methods represented by support vector machines and random forests were widely used for change detection in remote sensing images (Bovolo et al., Citation2008; Feng et al., Citation2021; Negri et al., Citation2021). However, owing to differences in imaging conditions of multitemporal remote sensing images, it is difficult to identify the changes from the background. Consequently, low detection performance is generally obtained by directly applying traditional machine learning methods. Moreover, traditional machine learning methods often require complex handcrafted feature extraction to ensure high-performance change detection. Typical extracted features include texture features, statistical features, and spatial structure features (Bai et al., Citation2014; Celik, Citation2009; Wu et al., Citation2014, Citation2017; Z. Li et al., Citation2017), and handcrafted features rely on expert knowledge and require setting many hyperparameters according to the data characteristics to ensure high detection performance.

In recent years, with continuous improvements in computing power and data acquisition, deep learning-based methods have greatly improved change detection accuracy. Therefore, deep learning has received extensive attention for change detection in remote sensing images (A. Zhang et al., Citation2020; L. Zhang et al., Citation2016). Unlike traditional machine learning methods, deep learning models can automatically learn to extract abstract features for downstream tasks and obtain higher detection accuracy provided that sufficient labeled training data are available. Deep learning has made great progress in change detection. Most existing methods use a CNN as the backbone for feature extraction. However, the receptive field of a CNN is limited, which results that these models cannot perceive a wide range of contextual information. On the other hand, the self-attention represented by the visual transformer provides a wider receptive field. Nevertheless, the original transformer cannot balance the local and global feature information well and has high computational complexity. In this paper, we propose a change detection method based on a deep hierarchical transformer. To cope with the deficiencies of the original transformer, we limit the self-attention calculation of each transformer layer to local windows with different sizes. To further improve the detection accuracy, we embed features of different scales into a nested U-Net and design a model fusion method. The main contributions of this study can be summarized as follows:

We introduce a hierarchical transformer to better extract features from remote-sensing images. We limit the self-attention calculation to the local window, which can make the model pay attention to the local features. At the same time, we use the shifted windows to make information interaction between different local windows, so that the model can also pay attention to the global features. In this way, the encoder can consider both local and global information, thereby improving the change detection accuracy.
To fully use the features extracted at different scales, we concentrate features of different scales and input them into a nested U-Net to complete change detection. Through the convolution layers of the nested U-Net, the extracted feature information can be interactive. More importantly, the nested structure allows features of different scales to be directly used as the input of the decoder. Therefore, in the nested U-Net, information between features at different scales could be exchanged, which makes the model better leverage the features for accurate change detection.
We propose a simple and effective model fusion strategy to improve the final change detection results by fusing the outcomes from augmented image pairs. Experimental results on two public datasets for change detection show that the proposed method can improve change detection, outperforming similar methods.

The remainder of this paper is organized as follows: In Section 2, we introduce the related work. In Section 3, we introduce the proposed method. In Section 4, we report change detection experimental results and the corresponding analyses. Finally, we draw conclusion in Section 5.

Related work

In the following, we will give a brief overview over the deep learning-based change detection methods. The two main areas are CNN-based method and Transformer-based method.

CNN based method

Among deep learning models, CNN is very suitable for extracting deep features of images. Therefore, the change detection method based on deep learning is mainly based on CNN. However, the change detection task needs to use multi-temporal remote sensing images as input, which leads to the inability to directly use the existing CNNs in computer vision. There are two main approaches for applying deep learning to change detection. One approach is stacking bi-temporal remote sensing images and inputting them into a fully convolutional network to obtain the change detection results. Classic fully convolutional networks models include FCN (Shelhamer et al., Citation2016), U-Net (Ronneberger et al., Citation2015), DeepLab series (L. C. Chen et al., Citation2017), PsPNet+ (Zhao et al., Citation2017), and UpperNet (Xiao et al., Citation2018). The other approach is using the siamese network structure to handle the dual-path inputs for change detection (B. Liu et al., Citation2021; H. Chen et al., Citation2020; Khelifi & Mignotte, Citation2020; Lee et al., Citation2021; Lv et al., Citation2020; M. Zhang et al., Citation2019). For example, a bilateral semantic fusion siamese network is designed for change detection to better map bi-temporal images into the semantic feature domain for comparison (Du et al., Citation2022).

The attention mechanism can improve deep learning by discarding irrelevant information and emphasizing areas that are important to the task, thus improving deep learning performance. Therefore, researchers have explored various attention mechanisms for change detection. For instance, H. Chen and Shi (Citation2020) proposed STANet based on spatiotemporal attention for improving the change detection accuracy. They also constructed the publicly available LEVIR-CD large-scale change detection dataset. Similarly, DASNet adopts a dual attention mechanism to improve the change detection accuracy (J. Chen et al., Citation2021). H. Cheng et al. (Citation2021) proposed a hierarchical self-attention augmented Laplacian pyramid-expanding network for highly accurate change detection. In addition, models such as high-resolution networks (Hou et al., Citation2021), recurrent convolutional neural networks (RNNs) (Mou et al., Citation2019; Sun et al., Citation2020), and generative adversarial networks (Peng et al., Citation2021) have also been used in change detection to better identify variations.

Transformer based method

CNNs usually have limitations in modeling global dependencies due to the intrinsic locality of convolution operations. The transformer has recently emerged as an alternative architecture for dense prediction tasks due to the global dependencies modeling ability brought by self-attention (Yuan et al., Citation2022). The change detection task requires the global dependency modeling ability to improve the detection accuracy. Therefore, researchers have made many meaningful explorations for change detection tasks based on Transformer. For example, a transformer encoder-decoder network named BIT is designed to enhance the context information extracted by CNN and expanded the receptive field (H. Chen et al., Citation2022). ChangeFormer unified hierarchically structured transformer encoder with Multi-Layer Perception (MLP) decoder in a Siamese network architecture to efficiently obtain accurate change detection results (Bandara & Patel, Citation2022). TransUNetCD designs an end-to-end encoding – decoding hybrid transformer model for change detection (Q. Li et al., Citation2022). It encodes the tokenized image patches from the convolutional neural network feature map to extract rich global context information. Therefore, TransUNetCD has the advantages of both transformers and CNNs. SwinSUNet designs a pure transformer network for change detection (C. Zhang et al., Citation2022).

Proposed change detection method

shows a diagram describing the proposed change detection method. First, we input bi-temporal remote sensing images into a deep hierarchical transformer to extract abstract features. Then, we concentrate on the features extracted from the four stages of a feature extraction backbone network and input them into the four stages of a nested U-Net, which finally provides the change detection results. In this section, we detail the transformer, nested U-Net, model training, and model fusion.

Figure 1. Diagram of proposed change detection method (Concat, concatenation).

Deep hierarchical transformer

In tasks such as image classification, object detection, and semantic segmentation, the quality of the backbone network used for feature extraction greatly affects the model performance (A. Zhang et al., Citation2020; B. Liu et al., Citation2018, Citation2019, Citation2021). Therefore, various feature extraction backbones with excellent performance have been proposed based on CNNs for diverse computer vision tasks. Such CNNs include VGG, ResNet, DenseNet, and HRNet. Using these backbones in change detection may greatly improve its accuracy. However, CNNs have limited receptive fields, impeding the full use of global context information. To address this limitation, we use a transformer as the feature extractor for change detection.

The transformer architecture was originally intended for natural language processing, and its core is the self-attention mechanism that can describe long-range dependencies. We use the transformer to capture global information from images. However, directly applying a global-attention-based transformer to images notably increases the computational complexity. Therefore, we introduce the hierarchical transformer shown in as the feature extraction backbone of the proposed change detection method (Z. Liu et al., Citation2021). The feature extraction backbone can be divided into four stages. Stage 1 is patch partition, which divides an image into 4 × 4 non-overlapping patches. As a remote-sensing image contains three bands, after each patch is flattened, the dimension of the one-dimensional feature vector is 48. Then, we use linear embedding to resize the feature vector from 48 dimensions to C dimensions. The transformed feature vectors can be regarded as a sequence, which is input into a transformer block to extract semantic features. Stage 1 comprises two transformer blocks and maintains the dimension (C) and number (H/4 × W/4, where H is the height and W is the width of the feature) of feature vectors. To extract hierarchical features, after stage 1, patch merging is used to aggregate features and reduce the number of feature vectors. Specifically, adjacent 2 × 2 C-dimensional feature vectors are merged into one 4C-dimensional feature vector, reducing the number of feature vectors to H/8 × W/8. Then, a linear embedding resizes the feature vector from 4C to 2C, and the feature vector sequence is input into the two transformer blocks of stage 2. The output passes through a patch merging layer and then through the six transformer blocks in stage 3. Similarly, the output of stage 3 passes through a patch merging layer and then through the two transformer blocks in stage 4. The four stages provide a hierarchical feature representation, resembling the hierarchical structure of CNNs that expand the receptive field as the network deepens. Hence, the transformer exploits global and local information.

Figure 2. Diagram of proposed deep hierarchical transformer (C, number of dimensions; H, height; LN, layer normalization; MLP, multilayer perceptron; SW-MSA, shifted-window multi-head self-attention; W, width; W-MSA, window multi-head self-attention).

The original transformer adopts the global self-attention mechanism shown in , and it is formulated as follows:

Figure 3. Data processing in various window self-attention mechanisms.

(1)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

where Q, K, and V are the query matrix, key matrix, and value matrix, respectively, of dimension d. Thus, the self-attention calculation is applied to both the feature vector corresponding to each image block and the feature vectors of all other image blocks. To extract more representative features, the transformer also adopts a multi-head self-attention mechanism to initialize multiple sets of Q, K, and V. In this study, we used eight sets. To reduce the high computational complexity of the global attention mechanism, as illustrated in , we adopt window multi-head self-attention (W-MSA) and constrain the self-attention calculation of each transformer block to a local window. Although this strategy reduces the model complexity, the features between different windows cannot communicate with each other, reducing the transformer’s ability to describe long-range dependencies. Therefore, we also adopt shifted-window multi-head self-attention (SW-MSA), as illustrated in , to shift the windows for a calculation to the lower-right corner, enabling information interaction between different windows.

The transformer block shown in can be formulated as follows:

(2)

\begin{aligned} {\hat{z}}^{l} = W - M S A (L N (z^{l - 1})) + z^{l - 1} \\ z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l} \\ {\hat{z}}^{l + 1} = S W - M S A (L N (z^{l})) + z^{l} \\ z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1} \end{aligned}

(2)

where MLP is a fully connected layer based on a multilayer perceptron to enhance the description of nonlinearity in the data and LN represents layer normalization applied along the channel dimension.

Nested U-Net

After the deep hierarchical transformer extracts features from bi-temporal remote-sensing images, the feature maps of two images obtained from the four stages of the feature extraction backbone are matched along the channel dimension. Then, to fully use the feature information among different layers and improve the change detection accuracy, we input the matched features into the four stages of the nested U-Net shown in and obtain the change detection results.

Figure 4. Diagram of nested U-Net for change detection.

The main difference between the nested U-Net and the original U-Net is the dense connection between the encoders and decoders. This dense connection allows comprehensive use of feature information at different scales and thereby improves the change detection accuracy.

Let $x^{i, j}$ be the feature map output by node $x^{i, j}$ , where $i$ is the index corresponding to the downsampling layer along the encoder dimension, and $j$ is the index along the skip connection dimension. The output of the feature map $x^{i, j}$ by any node is obtained as follows:

(3)

x^{i, j} = \{\begin{matrix} h (x^{i - 1, j}), & j = 0 \\ h ([{[x^{i, k}]}_{k = 0}^{j - 1}, φ (x^{i + 1, j - 1})]), & j > 0 \end{matrix}

(3)

where $h (∙)$ represents applying convolution and ReLU (rectified linear unit) activation to a feature, $φ (∙)$ represents upsampling, and $[]$ represents the connection of features. A node for $j = 0$ accepts only one input from the previous layer of the encoder subnetwork, while a node for $j = 1$ accepts inputs from two consecutive layers of the encoder subnetwork. Then, nodes for $j > 1$ accept up to $j + 1$ inputs, including $j$ outputs from the first $j$ nodes on the current skip connection path and one upsampling output from the next skip connection path.

Model training

Both the focal loss and Dice loss are suitable for imbalanced datasets and can improve the stability and effect of model training. Hence, we combine these functions into the final loss function. The focal loss can be calculated as

(4)

\begin{aligned} F L (p_{t}) = - η_{t} (1 - p_{t})^{ρ} log (p_{t}) \\ p_{t} = \{\begin{matrix} \hat{y} i f y = 1 \\ 1 - \hat{y} o t h e r w i s e \end{matrix} \end{aligned}

(4)

where $p_{t}$ is the close degree between ground truth y and the prediction $\hat{y}$ , $η_{t}$ is a weight that controls the contribution of positive and negative samples to the total loss, and $ρ$ is a predefined focusing parameter. The Dice loss can be calculated as

(5)

d = 1 - \frac{2 |y \cap \hat{y}|}{|y| + |\hat{y}|}

(5)

where y is the ground truth, $\hat{y}$ is the prediction of the model, $|y \cap \hat{y}|$ is the intersection between y and $\hat{y}$ , and $|y| + |\hat{y}|$ represents the union of y and $\hat{y}$ .

Extensive studies and experimental results have shown that data augmentation can help improve the effectiveness of model training. For training data augmentation, we applied random horizontal mirroring and vertical mirroring of images as well as random rotation and erasing of pixels for areas of four pixels randomly erased and maximum width and height of 50 pixels for the areas with random erasure.

Model fusion

To increase the change detection accuracy on test remote-sensing images, as shown in , we apply four augmentation operations to each original test image: horizontal mirroring, vertical mirroring, 90° rotation, and 270° rotation. The original image pair and augmented image pair are fed into the trained model separately to obtain five change detection results. Then, we undo the mirroring and rotation operations on the change detection results of the augmented images. The final change detection result is obtained by a voting rule on the five change detection results at every position. If two or more of the five change detection results predict a variation, the result at this position reflects a change.

Figure 5. Diagram of model fusion for change detection.

Experimental results and analysis

The hardware environment for the experiments performed in this study included an NVIDIA A100 graphics card, 40 GB video memory, and 256 GB main memory. The software environment was Ubuntu 18.04, and we used the PyTorch library to implement our method.

Datasets

To verify the effectiveness of the proposed change detection method, we used the LEVIR-CD (learning, vision, and remote sensing change detection (H. Chen & Shi, Citation2020)) and SYSU-CD (Sun Yat-Sen University change detection) (Q. Shi et al., Citation2022) large-scale public datasets containing remote sensing images for change detection.

The LEVIR-CD dataset contains 637 remote sensing images of 1024 × 1024 pixels and a resolution of 0.5 m. The dataset includes 31,333 change instances. Following the original splitting method, the numbers of training, validation, and test samples were set to 445, 64, and 128 images, respectively. Due to the video memory limitation, we divided the images into non-overlapping blocks of 512 × 512 pixels. Thus, the numbers of image pairs used for training, validation, and testing were 1780, 256, and 512, respectively.

The SYSU-CD dataset contains 20,000 aerial remote sensing image pairs of 256 × 256 pixels and a resolution of 0.5 m. Following the original splitting method, the numbers of training, validation, and test samples were set to 12,000, 4000, and 4000, respectively.

Evaluation metrics

To quantitatively evaluate the performance of different change detection methods, we used the F1-score, precision, recall, intersection over union (IoU), and overall classification accuracy (OA) as evaluation metrics. The evaluation metrics are calculated as follows:

(6)

\begin{aligned} p r e c i s i o n = T P / (T P + F P) \\ r e c a l l = T P / (T P + F N) \\ I o U = T P / (T P + F N + F P) \\ O A = (T P + T N) / (T P + T N + F N + F P) \\ F 1 = (2 \times r e c a l l \times p r e c i s i o n) / (r e c a l l + p r e c i s i o n) \end{aligned}

(6)

TP, FN, FP, and TN represent the numbers of true positives, false negatives, false positives, and true negatives, respectively.

Parameter settings and analysis

We used the Adam optimizer to train the model over 50 epochs. The training was divided into two stages, with the learning rate set to 0.0001 over the first 30 epochs and 0.00001 over the last 20 epochs.

As we combined the focal and Dice loss functions, their weights directly affect the change detection accuracy. To study the influence of the two loss functions on the change detection results, we fixed the weight of the Dice loss to 1.0 and evaluated weights of the focal loss of 0.5, 1.0, 2.0, and 4.0. The F1 scores obtained from the LEVIR-CD and SYSU-CD datasets are shown in . When the focal loss weight is relatively large (e.g. 2, 4), the change detection accuracy reduces. When the focal loss weight is 0.5, the detection accuracy on the LEVIR-CD dataset slightly improves, while that on the SYSU-CD dataset reduces. Therefore, we set the weights of both loss functions to 1.0, which provides balanced performance on both datasets.

Figure 6. Effect of focal loss weight on change detection accuracy.

Comparative analysis

To verify the effectiveness of the proposed method, we compared it with the conventional DeepLabV3, U-Net, PSPNet, and UpperNet for change detection in remote sensing images. In addition, we implemented the latest MSPSNet (Huang et al., Citation2021), ISNet (G. Cheng et al., Citation2022), BIT (H. Chen et al., Citation2022), and ChangeFormer (Bandara & Patel, Citation2022) using their publicly available codes. For a fair comparison, the training, validation, and test sets used for the proposed and comparison methods were the same.

The change detection results of the compared methods on the LEVIR-CD and SYSU-CD datasets are listed in , respectively. For a broader comparison, we included the change detection results of the proposed method without model fusion. For the LEVIR-CD dataset, the F1-score, precision, recall, and IoU of the proposed method without model fusion are better than those of the other methods. The precision of our proposal without model fusion is higher than that of similar methods BIT and Changformer. Hence, the proposed method improves the change detection accuracy. When adding model fusion, the proposed method can achieve the highest accuracy in the four evaluation metrics except for OA, further confirming the effectiveness of the proposed fusion strategy. For the SYSU-CD dataset, the F1-score, recall, and IoU of the proposed method with model fusion achieve the highest values, while the precision is lower than that of BIT, and the OA is slightly lower than that of ChangeFormer. In general, the proposed method with model fusion achieves the most balanced detection results, demonstrating its effectiveness and high performance

Table 1. Change detection performance of evaluated methods on LEVIR-CD dataset (%).

Download CSV Display Table

Table 2. Change detection performance of evaluated methods on SYSU-CD dataset (%).

Download CSV Display Table

To show the results of different methods, we randomly selected five image pairs from the LEVIR-CD and SYSU-CD datasets for change detection visualization, as shown in , respectively. The figures also show the image pairs and the corresponding ground truths. Panels (d)–(k) show the change detection results of DeepLabV3, U-Net, PSPNet, UpperNet, MSPSNet, BIT, ChangeFormer, and the proposed method with model fusion, respectively. The white area is the correct detection area (TP). The black area is the background area. The red area is the false detection area (FP), and the blue area is the missed detection area (FN). In the second row of , the comparison methods show false change detection areas, whereas the proposed method is more accurate. In the fifth row, the comparison methods miss many change areas, whereas the proposed method provides a more complete detection. In , the proposed method has fewer false detection and missed detection areas than the comparison methods, further demonstrating the effectiveness of our proposal.

Figure 7. Change detection results on images from the LEVIR-CD dataset. (a) Input , (b) input , (c) ground truth, and change detection results from (d) DeepLabV3, (e) U-Net, (f) PSPNet, (g) UpperNet, (h) MSPSNet, (i) BIT, (j) ChangeFormer, and (k) proposed method.

Figure 7. Change detection results on images from the LEVIR-CD dataset. (a) Input image 1, (b) input image 2, (c) ground truth, and change detection results from (d) DeepLabV3, (e) U-Net, (f) PSPNet, (g) UpperNet, (h) MSPSNet, (i) BIT, (j) ChangeFormer, and (k) proposed method.

Figure 8. Change detection results on images from SYSU-CD dataset. (a) Input , (b) input , (c) ground truth, and change detection results from (d) DeepLabV3, (e) U-Net, (f) PSPNet, (g) UpperNet, (h) MSPSNet, (i) BIT, (j) ChangeFormer, and (k) proposed method.

Figure 8. Change detection results on images from SYSU-CD dataset. (a) Input image 1, (b) input image 2, (c) ground truth, and change detection results from (d) DeepLabV3, (e) U-Net, (f) PSPNet, (g) UpperNet, (h) MSPSNet, (i) BIT, (j) ChangeFormer, and (k) proposed method.

Ablation study

To explore the impact of different data augmentation operations and model fusion on the change detection accuracy, we conducted ablation experiments on the two datasets. The experimental results listed in show that the change detection accuracy obtained without data augmentation is the lowest. The two data augmentation operations of horizontal and vertical mirroring and random rotation slightly improve the change detection accuracy. On the other hand, random erasing greatly improves the change detection accuracy by more than 0.5% on both datasets. Random erasing fills a certain area in the image with the same pixel value, thus covering the image information of the area, forcing the model to learn the features outside the area for recognition, to some extent, avoiding the model falling into local optimization, thus improving the generalization ability of the model. Based on data augmentation, the proposed model fusion method can further improve the change detection accuracy by 0.71% for the LEVIR-CD dataset and by 0.32% for the SYSU-CD dataset. This illustrates the necessity of using data augmentation to train change detection models and demonstrates the effectiveness of model fusion.

Table 3. F1-score (%) obtained from ablation experiments on LEVIR-CD and SYSU-CD datasets (check mark: application of operation or method).

Download CSV Display Table

To verify the effectiveness of the deep hierarchical transformer, we tested three models with different scales, Swin-tiny, Swin-small, and Swin-base. The network structures of the three models are shown in . We compare ResNet34, ResNet50, and ResNet101 as test benchmarks, and apply the model fusion strategy designed in this paper to each model. At the same time, we also implement nested U-Net without feature extraction backbone network. show the detection results of different models on two datasets. First, we can find that the detection accuracy of the model using a feature extraction backbone network is higher than that using nested U-Net alone. The feature extraction backbone network, such as Resnet and Swin, can extract more abstract features of different scales from multitemporal remote sensing images. These features are embedded in nested U-Net to make better use of multi-scale features. Therefore, the nested U-Net using the feature extraction backbone network can improve the change detection accuracy. Comparing the detection accuracy of different models, it can be found that compared with the ResNet, the deep hierarchical transformer used in this paper could significantly improve the detection accuracy, which proves the effectiveness of the method designed in this paper. However, we found that with the increase of the number of parameters of the deep hierarchical transformer, the detection accuracy has a downward trend, so we finally use the Swin-tiny model as the backbone network for feature extraction. Further comparing the accuracy of different models before and after applying the fusion method designed in this paper, we could find that the model fusion method designed could improve the change detection accuracy for different models, which fully demonstrates the effectiveness of the designed method. Finally, we also designed a strategy to fuse the detection results of the three models. The experimental results of the two datasets are also shown in . We could find that the recall has been greatly improved after the integration of the three models, but the corresponding precision has decreased significantly, F1 score has not improved. Considering that training the three models takes more time, and the improvement of accuracy is not obvious, this paper does not adopt such a multi model fusion strategy.

Table 4. The network structures of swin-tiny, swin-small, and swin-base.

Download CSV Display Table

Table 5. Change detection performance of evaluated methods on LEVIR-CD dataset (%).

Download CSV Display Table

Table 6. Change detection performance of evaluated methods on SYSU-CD dataset (%).

Download CSV Display Table

Execution efficiency analysis

To analyze the execution efficiency of the proposed method, we compared our method with two Transformer-based comparison methods (BIT and ChangeFormer).

reports the number of parameters (Params), floating point of operations (FLOPs), training time per epoch (Training), and testing time (Testing) of different methods on both datasets. The Params and the FLOPs of the proposed method are between BIT and ChangeFormer. Especially, the two comparison methods are trained according to the open-source code³, and the batch size during training is 8 or 16, while the proposed method has been optimized to use a larger batch size during training; therefore, the training time corresponding to the proposed method is shorter. The three methods employ the same batch size throughout the model testing phase, but the proposed method includes the model fusion strategy and slightly increases the testing time. In general, the proposed method does not significantly increase the number of parameters and complexity of the model while ensuring the accuracy of change detection.

Table 7. Execution efficiency of evaluated methods on LEVIR-CD and SYSU-CD datasets.

Download CSV Display Table

Conclusions

We propose a hierarchical transformer to improve change detection in high-resolution remote sensing images. Quantitative and qualitative experimental results on two change detection datasets, LEVIR-CD and SYSU-CD, show that the proposed hierarchical transformer for feature extraction combined with a nested U-Net can outperform conventional methods. To further improve the change detection accuracy, we adopt a model fusion method. The experimental results verify that fusion increases the change detection accuracy.

Disclosure statement

No potential conflict of interest was reported by the authors.

Data availability statement

The data employed in this paper include LEVIR-CD and SYSU-CD. The LEVIR-CD access URL for https://justchenhao.github.io/LEVIR/. The SYSU-CD access URL for https://hub.fastgit.org/liumency/SYSU-CD.

Additional information

Funding

The work was supported by the the Natural Science Foundation of Henan [41201477].

References

Bai, X., Zhang, H., & Zhou, J. (2014). VHR object detection based on structural feature extraction and query expansion. IEEE Transactions on Geoscience and Remote Sensing, 52(10), 6508–13. https://doi.org/10.1109/TGRS.2013.2296782
Web of Science ®Google Scholar
Bandara, W. G. C., & Patel, V. M. (2022). A transformer-based siamese network for change detection. IEEE International Geoscience and Remote Sensing Symposium(IGARSS), 17-22 July 2022. https://doi.org/10.1109/IGARSS46834.2022.9883686.
Google Scholar
Bovolo, F., Bruzzone, L., & Marconcini, M. (2008). A novel approach to unsupervised change detection based on a semisupervised SVM and a similarity measure. IEEE Transactions on Geoscience and Remote Sensing, 46(7), 2070–2082. https://doi.org/10.1109/TGRS.2008.916643
Web of Science ®Google Scholar
Celik, T. (2009). Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geoscience and Remote Sensing Letters, 6(4), 772–776. https://doi.org/10.1109/LGRS.2009.2025059
Web of Science ®Google Scholar
Cheng, G., Wang, G., & Han, J. (2022). Isnet: Towards improving separability for remote sensing image change detection. EEE Transactions on Geoscience and Remote Sensing, 60(5623811), 1–11. Art no. https://doi.org/10.1109/TGRS.2022.3174276
Google Scholar
Cheng, H., Wu, H., Zheng, J., Qi, K., & Liu, W. (2021). A hierarchical self-attention augmented Laplacian pyramid expanding network for change detection in high-resolution remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 182, 52–66. https://doi.org/10.1016/j.isprsjprs.2021.10.001
Web of Science ®Google Scholar
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
PubMed Web of Science ®Google Scholar
Chen, H., Qi, Z., & Shi, Z. (2022). Remote sensing image change detection with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–14. https://doi.org/10.1109/TGRS.2021.3095166
Web of Science ®Google Scholar
Chen, H., & Shi, Z. (2020). A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing, 12(10), 1–23. https://doi.org/10.3390/rs12101662
Web of Science ®Google Scholar
Chen, H., Wu, C., Du, B., Zhang, L., & Wang, L. (2020). Change detection in multisource VHR images via deep siamese convolutional multiple-layers recurrent neural network. IEEE Transactions on Geoscience and Remote Sensing, 58(4), 2848–2864. https://doi.org/10.1109/TGRS.2019.2956756
Web of Science ®Google Scholar
Chen, J., Yuan, Z., Peng, J., Chen, L., Huang, H., Zhu, J., Liu, Y., & Li, H. (2021). Dasnet: Dual attentive fully convolutional siamese networks for change detection in high-resolution satellite images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 1194–1206. https://doi.org/10.1109/JSTARS.2020.3037893
Web of Science ®Google Scholar
Du, H., Zhuang, Y., Dong, S., Li, C., Chen, H., Zhao, B., & Chen, L. Bilateral semantic fusion siamese network for change detection from multitemporal optical remote sensing imagery. (2022). IEEE Geoscience and Remote Sensing Letters, 19(6003405), 1–5. Art no. https://doi.org/10.1109/LGRS.2021.3082630
Google Scholar
Feng, X., Li, P., & Cheng, T. (2021). Detection of urban built-up area change from Sentinel-2 images using multiband temporal texture and one-class random forest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 6974–6986. https://doi.org/10.1109/JSTARS.2021.3092064
Web of Science ®Google Scholar
He, D., Zhong, Y., & Zhang, L. (2020). Spectral–spatial–temporal MAP-based sub-pixel mapping for land-cover change detection. IEEE Transactions on Geoscience and Remote Sensing, 58(3), 1696–1717. https://doi.org/10.1109/TGRS.2019.2947708
Web of Science ®Google Scholar
Hou, X., Bai, Y., Li, Y., Shang, C., & Shen, Q. (2021). High-resolution triplet network with dynamic multiscale feature for change detection on satellite images. ISPRS Journal of Photogrammetry and Remote Sensing, 177, 103–115. https://doi.org/10.1016/j.isprsjprs.2021.05.001
Web of Science ®Google Scholar
Huang, J., Shen, Q., Wang, M., & Yang, M. (2021). Multiple attention siamese network for high-resolution image change detection. IEEE Transactions on Geoscience and Remote Sensing. https://doi.org/10.1109/TGRS.2021.3127580.
Google Scholar
Ke, Q., & Zhang, P. (2021). CS-HSNet: A cross-siamese change detection network based on hierarchical-split attention. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 9987–10002. https://doi.org/10.1109/JSTARS.2021.3113831
Web of Science ®Google Scholar
Khelifi, L., & Mignotte, M. (2020). Deep learning for change detection in remote sensing images: Comprehensive review and meta-analysis. IEEE Access, 8, 126385–126400. https://doi.org/10.1109/ACCESS.2020.3008036
Google Scholar
Lee, H., Lee, K., Kim, J. H., Na, Y., Park, J., Choi, J. P., & Hwang, J. Y. (2021). Local similarity siamese network for urban land change detection on remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 4139–4149. https://doi.org/10.1109/JSTARS.2021.3069242
Web of Science ®Google Scholar
Li, Z., Shi, W., Zhang, H., & Hao, M. (2017). Change detection based on Gabor wavelet features for very high resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters, 14(5), 783–787. https://doi.org/10.1109/LGRS.2017.2681198
Web of Science ®Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., & Yang, X. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022 October. https://doi.org/10.1109/ICCV48922.2021.00986.
Google Scholar
Liu, B., Yu, A., Yu, X., Wang, R., Gao, K., & Guo, W. (2021). Deep multiview learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59(9), 7758–7772. https://doi.org/10.1109/TGRS.2020.3034133
Web of Science ®Google Scholar
Liu, B., Yu, X., Yu, A., Zhang, P., Wan, G., & Wang, R. (2019). Deep few-shot learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(4), 2290–2304. https://doi.org/10.1109/TGRS.2018.2872830
Web of Science ®Google Scholar
Liu, B., Yu, X., Zhang, P., Yu, A., Fu, Q., & Wei, X. (2018). Supervised deep feature extraction for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(4), 1909–1921. https://doi.org/10.1109/TGRS.2017.2769673
Web of Science ®Google Scholar
Li, Q., Zhong, R., Du, X., & Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. (2022). IEEE Transactions on Geoscience and Remote Sensing, 60(5622519), 1–19. Art no. https://doi.org/10.1109/TGRS.2022.3169479
Google Scholar
Lv, Z., Liu, T., Kong, X., Shi, C., & Benediktsson, J. A. (2020). Landslide inventory mapping with bitemporal aerial remote sensing images based on the dual-path fully convolutional network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 4575–4584. https://doi.org/10.1109/JSTARS.2020.2980895
Web of Science ®Google Scholar
Mou, L., Bruzzone, L., & Zhu, X. X. (2019). Learning spectral-spatial-temporal features via a recurrent convolutional neural network for change detection in multispectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 57(2), 924–935. https://doi.org/10.1109/TGRS.2018.2863224
Web of Science ®Google Scholar
Negri, R. G., Frery, A. C., Casaca, W., Azevedo, S., Dias, M. A., Silva, E. A., & Alcântara, E. H. (2021). Spectral–spatial-aware unsupervised change detection with stochastic distances and support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 59(4), 2863–2876. https://doi.org/10.1109/TGRS.2020.3009483
Web of Science ®Google Scholar
Peng, D., Bruzzone, L., Zhang, Y., Guan, H., Ding, H., & Huang, X. (2021). SemiCDNet: A semisupervised convolutional neural network for change detection in high resolution remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing, 59(7), 5891–5906. https://doi.org/10.1109/TGRS.2020.3011913
Web of Science ®Google Scholar
Ronneberger, O., Fischer, P., & Brox, T. (2015).U-Net: Convolutional networks for biomedical image segmentation. in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. https://doi.org/10.1007/978-3-319-24574-4_28.
Google Scholar
Shelhamer, E., Long, J., & Darrell, T. (2016). Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 640–651. https://doi.org/10.1109/TPAMI.2016.2572683
PubMed Web of Science ®Google Scholar
Shi, Q., Liu, M., Li, S., Liu, X., Wang, F., & Zhang, L. (2022). A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–16. https://doi.org/10.1109/TGRS.2021.3085870
Web of Science ®Google Scholar
Shi, W., Zhang, M., Zhang, R., Chen, S., & Zhan, Z. (2020). Change detection based on artificial intelligence: State-of-the-art and challenges. Remote Sensing, 12(10), 1–35. https://doi.org/10.3390/rs12101688
Web of Science ®Google Scholar
Sun, S., Mu, L., Wang, L., & Liu, P. (2020). L-UNet: An LSTM network for remote sensing image change detection. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2020.3041530
Google Scholar
Wu, C., Du, B., & Zhang, L. (2014). Slow feature analysis for change detection in multispectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 52(5), 2858–2874. https://doi.org/10.1109/TGRS.2013.2266673
Web of Science ®Google Scholar
Wu, C., Zhang, L., & Du, B. (2017). Kernel slow feature analysis for scene change detection. IEEE Transactions on Geoscience and Remote Sensing, 55(4), 2367–2384. https://doi.org/10.1109/TGRS.2016.2642125
Web of Science ®Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. in European Conference on Computer Vision, pp. 418–434. https://doi.org/10.48550/arXiv.1807.10221.
Google Scholar
Yuan, J., Wang, L., & Cheng, S. (2022). STransUNet: A siamese transUNet-based remote sensing image change detection network. in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 9241–9253, https://doi.org/10.1109/JSTARS.2022.3217038.
Google Scholar
Zhang, C., Wang, L., Cheng, S., & Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. (2022). IEEE Transactions on Geoscience and Remote Sensing, 60(5224713), 1–13. Art no. https://doi.org/10.1109/TGRS.2022.3160007
Google Scholar
Zhang, M., Xu, G., Chen, K., Yan, M., & Sun, X. (2019). Triplet-based semantic relation learning for aerial remote sensing image change detection. IEEE Geoscience and Remote Sensing Letters, 16(2), 266–270. https://doi.org/10.1109/LGRS.2018.2869608
Web of Science ®Google Scholar
Zhang, A., Yue, P., Tapete, D., Jiang, L., Shangguan, B., Huang, L., & Liu, G. (2020). A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 166, 183–200. https://doi.org/10.1016/j.isprsjprs.2020.06.003
Web of Science ®Google Scholar
Zhang, L., Zhang, L., & Du, B. (2016). Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2), 22–40. https://doi.org/10.1109/MGRS.2016.2540798
Web of Science ®Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6230–6239. https://doi.org/10.1109/CVPR.2017.660.
Google Scholar

Deep hierarchical transformer for change detection in high-resolution remote sensing images

ABSTRACT

Introduction