141
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A dual-difference change detection network for detecting building changes on high-resolution remote sensing images

, , , , &
Article: 2322080 | Received 11 Dec 2023, Accepted 16 Feb 2024, Published online: 13 Mar 2024

Abstract

Existing deep learning-based change detection networks encounter challenges related to the temporal dependency inherent in dual-temporal images. In this study, a weight-shared dual-difference change detection network(DDCDNet) model is proposed based on feature extraction networks. The model employs feature discrimination modules fused with spatial and channel attention mechanisms at different hierarchical levels of the backbone network. In the encoding phase, the tiny version of the Swin Transformer is utilized as the backbone network, with a weight-sharing strategy applied to extract feature information from bi-temporal remote sensing images. The proposed model in this paper is experimented on the LEVIR-CD + and DSIFN datasets, achieving F1-scores (F1) of 87.71% and 85.79%,recalls of 83.87% and 81.17%, and IoU (Intersection over Union) scores of 78.11% and 75.12%, respectively. These results indicate that the proposed model significantly outperforms other comparative models, demonstrating a better capability of identifying temporal changes in buildings, excellent generalization capability.

1. Introduction

Change detection is the process of identifying differences in the state of objects or phenomena through observations at different times (Shi et al. Citation2020). Nowadays, it has been widely applied in various domains such as land resource planning, urban planning, construction land supervision, and natural disaster monitoring and assessment (Song et al. Citation2014; Hölbling et al. Citation2015; Feizizadeh et al. Citation2017). Timely and accurate understanding of surface cover changes plays a crucial role in national economic development, social progress, and ecological preservation. Buildings the primary location for human activities and represent artificial geographic target, and their accurate change detection holds profound significance for rational land resource utilization (Li et al. Citation2019), urban development (Chen et al. Citation2021), improving the well-being of residents (Chen et al. Citation2021). Therefore, how to detect changes of buildings has emerged as a prominent research topic in the fields of remote sensing, measurement (Wen et al. Citation2021), and artificial intelligence.

Over the past few decades, researchers have proposed numerous change detection algorithms, which have evolved and developed from traditional pixel-based methods to deep learning-based approaches (Zheng et al. Citation2022). Early methods primarily relied on image processing techniques such as threshold segmentation and morphological operation These methods were simple and effective, but they were sensitive to image noise and prone to false detections. Subsequently, feature-based approaches are proposed wherein such methods enhance the representation capability of buildings changes by extracting features such as gradients and textures from images, thereby further improving the accuracy of buildings change detection. Meanwhile, change detection methods based on multispectral features (Du et al. Citation2024; Wu et al. Citation2022; Eftekhari et al. Citation2023; Basavaraju et al. Citation2022) have been extensively utilized, demonstrating efficacy in alleviating the loss of image details (Zhou et al. Citation2022). Due to the presence of various complex change types and noise interference in images, as well as the inherent issue of uncertain edges in buildings where the boundaries are not well-defined, the task of building change detection remains a challenging problem.

In recent years, deep learning has achieved revolutionary breakthroughs in the field of computer vision. In particular, convolutional neural networks (CNNs)(Kattenborn et al. Citation2021) have demonstrated outstanding performance in tasks such as image classification and object detection (Yang et al. Citation2023). Therefore, many researchers have introduced deep learning to the change detection of buildings. For instance, in CosimNet (Guo et al. Citation2018), two fully convolutional siamese network structures, FC-Siam-Conc and FC-Siam-Diff (Daudt et al. Citation2018), with skip connections were proposed for the first time. Building upon DenseNet(Huang et al. Citation2016), SNUNet-CD (Fang et al. Citation2022) combined siamese networks with Unet++(Zhou et al. Citation2019) to propose the densely connected siamese network, SNUNet, for detecting building changes. In addition, methods based on multi-level feature fusion (Chen et al. Citation2021; Peng et al. Citation2021; Xu et al. Citation2023) have also become the focus of many researchers. For instance, the BCDetNet (Basavaraju et al. Citation2023) utilizes multiple feature extraction modules to detect small change regions. The RSCDNet (Barkur et al. Citation2023), which improves the context feature fusion approach, achieves pixel-level change detection in dual-temporal high-resolution remote sensing (HRRS) images and exhibits promising performance. Nevertheless, these methodologies often face challenges in effectively harnessing the abundant feature information present at different times to discriminate the underlying differences. Moreover, they are vulnerable to the impact of data imbalance issues (Morgan et al. Citation2022), ultimately leading to a degraded performance in the model’s generalization capability.

To address the aforementioned issues, a deep learning-based dual-difference architecture for building change detection is proposed. Firstly, we employ an adaptive weight allocation strategy to dynamically adjust the weights of the loss function based on the complexity of local image features, thereby mitigating the impact of data imbalance on the model. Secondly, different feature difference modules are utilized for the feature maps from different levels of the backbone network output, aiming to fully exploit the feature information within the maps. Furthermore, a multi-level feature fusion strategy is designed to extract multi-scale features of building changes effectively by introducing cross-scale contextual information. Through experiments conducted on multiple remote sensing image datasets, we demonstrate the superior performance of the proposed method in building change detection tasks. Therefore, this paper aims to propose a deep learning-based change detection algorithm to enhance the accuracy and applicability of change detection.

The main contributions of this paper are as follows:

  1. A novel building change detection algorithm based on siamese networks is proposed for high-resolution remote sensing images of building. This method effectively addresses the issue of uncertain edge changes in buildings by fully leveraging the rich features inherent in remote sensing images and integrating low-level detailed information with high-level semantic information.

  2. A feature processing module is designed during the encoding phase to address the differences in channel numbers and feature map sizes between low-level and high-level features. This module strengthens the network’s ability to extract local information features andenhances context modeling, thereby reducing noise interference and improving theoverall performance and efficiency of the model.

  3. Detailed experiments and visual analysis are conducted on multiple datasets for building change detection, validating the superior performance and robustness of the proposed method in the task of detecting changes in building.

The remaining sections of this paper are organized as follows: Chapter 2 elaborates on the detailed structure of the proposed building change detection model, DDCDNet, including differential modules employing different strategies and a multi-level feature fusion strategy. Chapter 3 presents the experimental setup and analysis of the experimental results. Chapter 4 conducts an analysis and discussion of the results. Finally, Chapter 5 concludes the paper and outlines future research directions.

2. Methods

In existing remote sensing image-based methods for building change detection, challenges persist in fully leveraging the differential information between non-simultaneous features in dual-temporal images, as well as in discerning uncertain edge changes in buildings. Therefore, a change detection model utilizing a dual-difference module is proposed in this paper, for efficient extraction of feature information from remote sensing images of buildings.

To reduce overfitting and obtain an optimized model, data augmentation techniques such as random rotation, vertical flipping, and random scale cropping are applied to the training dataset. Then, the dual-temporal building remote sensing images at time points 1 and 2 are used as inputs for the model training to extract deep-level features, followed by a decoder to obtain the change detection results. Finally, comparative experiments are conducted on different datasets, and detailed visual analysis is performed to evaluate the superiority of the proposed model in this paper.

2.1. The overall architecture of the network

The pre-change and post-change dual-temporal remote sensing images in this paper are used as inputs to the DDCDNet model, which outputs a binary classification change detection map. The overall architecture of the DDCDNet change detection network model, as shown in , consists of the following core components: a weight-shared Swin-tiny backbone network for feature extraction, spatial difference module, dilated convolution pooling difference module, and multi-level feature fusion module. The network structure follows an Encoder-Decoder (Cho et al. Citation2014) architecture.

Figure 1. The structure of DDCDNet.

Figure 1. The structure of DDCDNet.

The registered dual-temporal remote sensing images are first inputted into the weight-shared encoder. In this paper, the Swin Transformer (Liu et al. Citation2021) tiny version is employed to construct dual encoders that separately process the pre-change and post-change dual-temporal images, enabling multi-scale semantic feature extraction. Different discrepancy modules are used to compute the differences in multi-level features between the pre-change and post-change images from the Swin Transformer encoder. The discrepancy modules proposed in this paper go beyond a mere calculation of the absolute difference between Fprei and Fposti. Instead, they dynamically learn the optimal distance measure for each scale during the training process, contributing to enhanced performance in building change detection. The change feature information of the shallow stages, namely stage 1 and stage 2, is processed by the difference1 module, while the change feature information of the deep stages, namely stage 3 and stage 4, is processed by the difference2 module. This differential feature extraction method effectively utilizes the feature information at different levels. By extracting features for each scale of the images, multi-scale feature representations are generated, and all levels of feature maps possess strong semantic information. The obtained features F1, F2, and F3 are then convolved with a 3 × 3 kernel and combined with F4, which are jointly fed into the multi-level feature fusion module. This fusion process integrates both high-level features and low-level features.

The following sections provide a detailed introduction to the four core modules used for change detection in the encoder-decoder structure: the weight-shared Swin-tiny backbone network for feature extraction, spatial difference module, dilated convolution pooling difference module, and multi-level feature fusion module.

2.2. The backbone network for feature extraction is based on the weight-sharing Swin-transformer approach

Traditional change detection models often employ CNN as the backbone network, which primarily extracts features from images through convolutional operations. However, due to fixed convolutional kernels, CNNs can only capture local information and struggle to effectively process global information. Furthermore, CNNs suffer from issues such as gradient vanishing and overfitting. In contrast, Transformer (Strudel et al. Citation2021) utilize self-attention mechanisms to capture relationships within sequences, enabling simultaneous consideration of both global and local information. Additionally, Transformers possess advantages such as parallelizability and high computational efficiency. Notably, Swin Transformer, as a Transformer-based network structure, has achieved remarkable performance in the field of computer vision.

The Swin Transformer utilizes a hierarchical attention mechanism to capture features at different scales (). Compared to traditional Transformers, the Swin Transformer introduces a local window mechanism that applies attention mechanisms to local regions, reducing computational complexity. Additionally, the Swin Transformer employs a multi-stage strategy to balance computational efficiency and accuracy. Furthermore, it adopts a weight-sharing strategy, reducing the number of parameters during network training, thereby improving the model’s generalization capability.

Figure 2. Two successive Swin Transformer blocks.

Figure 2. Two successive Swin Transformer blocks.

The entire model follows a hierarchical design, comprising a total of 4 stages. With the exception of the first stage, each stage undergoes a Patch Merging layer to reduce the resolution of the input feature map, thereby performing downsampling operations. This process is analogous to the progressive enlargement of receptive fields in CNNs, enabling the acquisition of global information.

The Patch Merging operation is employed as the final step in each stage, following the Swin Transformer Block operation and preceding the downsampling process. However, in the last stage, downsampling is not required. Instead, a subsequent fully connected layer is utilized to compute the loss in conjunction with the target label.

2.3. The spatial difference module (difference1 module)

In this module (), the features F(i+1) are concatenated with F through a cascading operation. This process involves extracting features from multiple scales of feature maps and concatenating them to effectively handle objects of different scales. By incorporating these cascaded feature maps, the model can better capture and represent objects at various scales.

Figure 3. The structure of difference1 module.

Figure 3. The structure of difference1 module.

By employing the Swin Transformer as the backbone network, the outputs of stage 1 and stage 2 exhibit larger feature map sizes and contain minimal semantic information. However, they are capable of capturing local features within the image.

In stage 1 and stage 2, the feature representation is enhanced through the utilization of a Spatial Attention Module (Woo et al. Citation2018; Yang et al. Citation2022). This module operates by first obtaining query, key, and value tensors through three convolutional layers applied to the input feature map. Next, the query and key tensors undergo a dot product operation, resulting in a similarity matrix that calculates the weights for each position. The similarity matrix is then subjected to a softmax operation to obtain the weights for each position. Finally, the weighted feature representation is obtained by multiplying the weights with the corresponding values. By computing the weights for each position in the feature map, the model can capture different feature information at various stages, thereby improving its performance. (1) Fi=Add(Concat(Fprei,Fposti), Upsample(¨bilinear¨,Concat(Fprei+1,Fposti+1)))(1)

The fused Fi serves as the input to the spatial attention module. The advantage of the Spatial Attention Module lies in its ability to learn the importance of each position, thereby effectively capturing key features within the image. In the context of change detection tasks, these key features play a crucial role in distinguishing different classes of pixels. By incorporating the Spatial Attention Module, the model can better discriminate between different classes of pixels. Consequently, in stage 1 and stage 2, the Spatial Attention Module is applied to local regions, with a particular emphasis on considering neighboring pixels. This approach enables the module to capture local feature information more effectively.

2.4. The dilated convolution pooling difference module (difference2 module)

To better capture global feature information, the difference2 module () is employed to process the stage 3 and stage 4 of the Swin Transformer backbone network. (2) F̂i=Concat(Fprei,Fposti)(2) F̂i serves as the input to the ASPP (Atrous Spatial Pyramid Pooling) (Chen et al. Citation2018) module. The ASPP module is employed to extract features at different scales. It achieves this by utilizing multiple dilated convolutions (Li and Wang Citation2022), which involve inserting holes within the convolutional kernels to expand their receptive fields, thereby capturing a broader range of contextual information. By employing dilated convolutions of varying sizes, the ASPP module effectively handles feature information at different scales. This enables the model to better understand objects and scenes of different scales within the image, thus improving its performance without sacrificing resolution. By incorporating the ASPP module, the model can increase its receptive field and enhance its capability to capture diverse contextual information.

Figure 4. The structure of difference2 module.

Figure 4. The structure of difference2 module.

During the feature extraction process, downsampling of the input image is often performed, leading to information loss. However, by incorporating the ASPP module, it becomes possible to extract features at different scales without reducing the resolution. This mitigates the loss of information and allows for the preservation of important details. The ASPP module effectively addresses this concern by enabling the extraction of multi-scale feature information, thereby minimizing information loss while maintaining the original resolution of the image.

The ASPP module is employed in conjunction with channel attention (Khotimah et al. Citation2022) to enhance the model’s adaptability to different scenes and lighting conditions, thereby improving its robustness. This approach increases the receptive field, enabling the acquisition of a broader range of contextual information. It facilitates the model’s understanding of objects and scenes at various scales within the image. By adaptively adjusting the weights of each channel based on their significance, the module effectively handles feature information at different scales, allowing the model to capture features of varying scales more effectively. The channel attention module aids in capturing relationships among different channels, thereby enhancing the model’s feature representation capabilities. It assists in better understanding the relationships among different regions in aerial images of buildings. Additionally, the adaptive adjustment of channel weights reduces the number of model parameters, enhancing efficiency and generalization ability.

2.5. The multi-level feature fusion (MLFF) module

The Multi-Level Feature Fusion (MLFF) module effectively integrates features from different levels of the feature hierarchy. It primarily employs a series of convolutional layers with skip connections to extract features at multiple scales. In this study, a design is proposed that utilizes a series of convolutional layers with skip connections to merge multi-scale feature maps with lower-level features. The skip connections are introduced to preserve the spatial information of the lower-level features. Specifically, the lower-level feature Fdiff2 is concatenated with the output of the first convolutional layer in the fusion module. Subsequently, a series of convolutional layers is applied to further fuse the features and reduce spatial resolution. This approach enables the combination of detailed information from lower levels with semantic information from higher levels.

The Multi-Level Feature Fusion (MLFF) module efficiently integrates information from different scales and levels, thereby enhancing the performance of feature extraction. By combining features from various scales and levels, the MLFF module effectively captures both local and global information, allowing for a more comprehensive representation of the input data. This integration of multi-scale and multi-level information contributes to improved feature extraction capabilities, leading to enhanced performance in various tasks.

3. Results

3.1. Experimental data

To thoroughly validate the effectiveness and robustness of the proposed model in this paper, experiments are conducted on two widely used remote sensing image change detection datasets, namely LEVIR-CD+(Chen and Shi Citation2020) and DSIFN (Zhang et al. Citation2020). These datasets are selected specifically to address various challenges encountered in change detection tasks. The following provides a detailed description of the data:

  1. LEVIR-CD + dataset

    The LEVIR-CD + dataset is a large-scale dataset specifically designed for building change detection. It is constructed based on the existing publicly available dataset, LEVIR-CD. The LEVIR-CD + dataset consists of a total of 1,970 samples, with 985 samples designated for training and the remaining 985 samples reserved for testing.

    Each sample in the dataset includes a pair of remote sensing images: the pre-event image and the post-event image, along with their corresponding building change label map. The remote sensing images in this dataset have a size of 1024 x 1024 pixels, with a pixel resolution of 0.5 meters.

    The label map is of the same size as the images, with pixel values of either 0 or 255 (). A pixel value of 0 indicates the absence of building changes, while a value of 255 represents the presence of building changes. The dataset contains approximately 80,000 instances of changed buildings.

  • 2. DSIFN dataset

Figure 5. The examples of multi-temporal images from the LEVIR-CD + dataset.

Figure 5. The examples of multi-temporal images from the LEVIR-CD + dataset.

The DSIFN dataset is manually collected from Google Earth. It consists of six large-scale pairs of high-resolution images captured at different time points, covering six cities in China, namely Beijing, Chengdu, Shenzhen, Chongqing, Wuhan, and Xi’an. Five major image pairs (i.e. Beijing, Chengdu, Shenzhen, Chongqing, and Wuhan) are cropped into 394 sub-image pairs, each with a size of 512 × 512 pixels. Through data augmentation, a collection of 3,940 pairs of dual-time images was obtained.

For model testing purposes, the Xi’an image pairs are further cropped into 48 sub-image pairs (). The training dataset consists of 3,600 image pairs, the validation dataset contains 340 image pairs, and the test dataset comprises 48 image pairs.

Figure 6. The examples of multi-temporal images from the DSIFN datase.

Figure 6. The examples of multi-temporal images from the DSIFN datase.

3.2. Experimental environment

All experiments in this study are conducted on the Ubuntu 19.10 operating system. The PyTorch deep learning framework is employed for model development and training. The hardware configuration consistes of a NVIDIA GeForce GTX 1080 Ti GPU with 11GB of dedicated memory. The software configuration includes Python 3.9.7 and PyTorch 1.10.2, with CUDA version 10.2 for GPU acceleration.

3.3. Setting of training parameters

For the configuration of training parameters, we employed the MMSegmentation framework. We employed Cross-Entropy as the loss function and utilized AdamW as the optimizer, with a learning rate of 0.0006 and an AdamW weight decay of 0.01. The training schedule employs the poly learning rate decay strategy.

3.4. Data augmentation and evaluation metrics

To mitigate overfitting and enhance the training model’s performance, data augmentation techniques are employed on the original data. Random rotation, Gaussian blur, and random scale cropping are utilized as means of data augmentation. Subsequently, the dual-temporal remote sensing images are fed as inputs to the dual encoders to extract deep-level features. These features are then fused through a feature fusion module and decoded by the decoder to obtain the final results of change detection in this study.

To assess the performance of the model, this study employs F1-score, Recall, Intersection over Union(IoU), and Precision as evaluation metrics. The definitions of these metrics are as follows: (3) Precision=TPTP+FP(3) (4) Recall=TPTP+FN(4) (5) IoU=TPTP+FP+FN(5) (6) F1score=2Precision RecallPrecision+Recall(6)

In the equations, TP(True Positive) represents the number of correctly detected changed pixels. FP(False Positive) represents the number of non-changed pixels that were incorrectly detected as changed. FN(False Negative) represents the number of changed pixels that were not detected. TN(True Negative) represents the number of correctly detected non-changed pixels. For the task of change detection, a higher Precision value indicates a lower number of false positives, while a higher Recall value indicates a lower number of false negatives. The values of F1-score and IoU reflect the overall performance and robustness of the model. These two metrics provide insights into the model’s ability to accurately detect changes and achieve spatial agreement with the ground truth.

3.5. Analysis of experimental results

The proposed method is applied to two datasets, namely LEVIR-CD + and DSIFN, for experimental evaluation. To validate the superiority of the proposed method, comparative experiments are conducted against several traditional change detection methods in remote sensing images (FC-EF, FC-Siam-Diff, and FC-Siam-Conc), as well as deep learning-based change detection methods (SNUNet, ChangerEx(Fang et al. Citation2023), ChangeFormer(Bandara and Patel Citation2022), and BIT (Chen et al. Citation2022). The experimental results are presented in and , with the bold numbers indicating the best values.

Table 1. Performance with the LEVIR-CD + dataset.

Table 2. Performance with the DSIFN dataset.

The results demonstrate that the proposed method performs well on both datasets. Compared to the state-of-the-art change detection algorithm ChangerEx, the proposed method achieves an IoU improvement of 1.11% and 17.13% on the two datasets, respectively. Particularly, it exhibited significant advantages on the DSIFN dataset, with an F1-score increase of 12.38% and a Recall improvement of 1.36% compared to ChangerEX. Moreover, the proposed method demonstrates more stable performance across different datasets, whereas ChangerEx exhibits notable performance variations between the LEVIR-CD + and the DSIFN datasets. Notably, the algorithms employed in this study utilize lightweight backbone networks, highlighting their efficiency and effectiveness.

To visually demonstrate the advantages of the proposed method, the test images from both datasets are subjected to visualization process. The experimental results are shown in and , where a randomly selected set of test results from the LEVIR-CD + and DSIFN datasets respectively are visually presented. It can be observed that, compared to the state-of-the-art ChangerEx method, the proposed method exhibits lower false positive and false negative rates, and the detected change areas are more closely aligned with the ground truth, showcasing its superior performance.

Figure 7. Visualization of the change detection results for the LEVIR-CD + dataset. Different colors are used for a better view, i.e. black for true negative, green for true positive, red for false negative, and yellow for false positive.

Figure 7. Visualization of the change detection results for the LEVIR-CD + dataset. Different colors are used for a better view, i.e. black for true negative, green for true positive, red for false negative, and yellow for false positive.

Figure 8. Visualization of the change detection results for the DSIFN dataset. Different colors are used for a better view, i.e. black for true negative, green for true positive, red for false negative, and yellow for false positive.

Figure 8. Visualization of the change detection results for the DSIFN dataset. Different colors are used for a better view, i.e. black for true negative, green for true positive, red for false negative, and yellow for false positive.

3.6. Ablation experiment

To assess the effectiveness of each module in DDCDNet, ablation experiments were conducted on the LEVlR-CD + dataset, utilizing the Swin Transformer as the base model. The resulting experimental outcomes are detailed in . It is evident from the table that each module within the proposed DDCDNet exhibits superior performance. While the introduction of the difference1 module attains the highest precision, other metrics show suboptimal results. However, upon integrating thedifference2 and MLFF (Multi-Level Feature Fusion) modules, the overall metrics demonstrate a balanced and optimal performance, thereby confirming their correctness. Remarkably, DDCDNet achieves peak values for F1 score, recall, and loU.

Table 3. Ablation studies of different modules on LEVIR-CD + dataset al.l scores are expressed as percentages (%). the best scores are marked in bold.

4. Discussion

In this section, the advantages and future directions for improvement of the proposed method are discussed. The method proposed in this paper is compared with several advanced deep learning-based change detection methods on the LEVIR-CD + and DSIFN building change datasets. The comparison indicates that the proposed method can achieve better performance than other methods, particularly in scenarios with a limited number of small change detection samples. This improvement is partially attributed to the construction of sample extraction for data augmentation. Distinct from other CNN-based backbone networks, the proposed network in this study employs the Swin Transformer, based on the Transformer architecture, as the backbone network for extracting features from the input dual-temporal building remote sensing images. This choice allows for a more comprehensive focus on the differences between ground features at different times. The adopted dual-difference module can better extract low-level and high-level features, effectively improving the model’s ability to capture both local and global features in the input images and reducing false positives and false negatives.

In order to further validate the performance of DDCDNet, experiments were conducted on the LEVIR-CD dataset, comparing it with the latest feature fusion-based building change detection methods, RSCDNet and BCDetNet. The experimental results are presented in . Despite a slightly lower F1 score compared to BCDetNet, DDCDNet still demonstrates outstanding performance.

Table 4. The comparative experimental results of DDCDNet with the latest state-of-the-art (SOTA) methods on the LEVIR-CD dataset.

It must be acknowledged that DDCDNet is limited by its reliance on a large amount of manual pixel-by-pixel annotations as the training data, which increases both the labor and practical costs. Through comparison, it was found that DDCDNet suffers from the issue of excessive parameter count, which is primarily attributed to the incorporation of numerous fusion modules. Due to the low spectral resolution of the input images, a limited number of channels(bands), similarity in foreground and background colors of buildings, and occlusion caused by trees and other factors in high-resolution remote sensing images, there are cases of missed detections. Future work will focus on the following aspects:

  1. Constructing weakly supervised or even unsupervised network models to reduce the time and labor costs associated with manual annotation.

  2. Incorporating unmanned aerial vehicle (UAV) point cloud data to eliminate occluded regions of buildings.

  3. Conducting further research on buildings with similar foreground and background colors to improve detection accuracy.

5. Conclusions

To enhance the accuracy of building change detection from remote sensing images, this study proposes different feature extraction modules and feature fusion modules for various hierarchical feature maps. The use of the difference1 modules allows for better handling of objects at multiple scales, enabling improved capture of local features in the images. The difference2 module is utilized to better utilize its extracted global features. Finally, the MLFF module is employed to fuse the extracted features. The proposed dual-difference module in this study achieves an F1-score improvement of 7.25% and 5.75% over Changeformer on the LEVIR-CD + and DSIFN datasets, respectively. In the future, each module of DDCDNet will be applied to more datasets, and existing modules will be adjusted or new modules will be added to adapt to multi-class change detection tasks, further improving the performance of change detection in remote sensing images. Improving the difference module and reducing the model size will be considered as future research endeavors. Additionally, the construction of weakly supervised models will be explored to further mitigate the artificial and temporal costs associated with manual annotations.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The datasets supporting this study’s findings available online. The LEVIR-CD + dataset is available for download at https://justchenhao.github.io/LEVIR. The DSIFN dataset is available for download at https://github.com/GeoZcx/A-deeply-supervised-image-fusion-network-for-change-detection-in-remote-sensing-images/tree/master/dataset.

Additional information

Funding

Science and Technology Project of Qinghai Province (No. 2023-QY-208).

References

  • Bandara WG, Patel VM. 2022. A transformer-based siamese network for change detection. IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium . p. 207 – 210. doi:10.1109/IGARSS46834.2022.9883686.
  • Barkur R, Suresh D, Lal S, Reddy, CS, Diwakar, PG. 2023. RSCDNet: a robust deep learning architecture for change detection from bi-temporal high resolution remote sensing images. IEEE Trans Emerg Top Comput Intell. 7(2), 537–551. doi:10.1109/TETCI.2022.3230941.
  • Basavaraju KS, Hiren NS, Sravya N, Lal S, Nalini J, Sudhakar Reddy C. 2023. BCDetNet: a deep learning architecture for building change detection from bi-temporal high resolution satellite images. Int J Mach Learn Cyber. 14(12):4047–4062. doi:10.1007/s13042-023-01880-z.
  • Basavaraju KS, Sravya N, Lal S, Nalini J, Reddy CS, Dell’Acqua F. 2022. UCDNet: a deep learning model for urban change detection from bi-temporal multispectral Sentinel-2 satellite images. IEEE Trans Geosci Remote Sens. 60:1–10. doi:10.1109/TGRS.2022.3161337.
  • Chen J, Fan J, Zhang M, Zhou Y, Shen C. 2022. MSF-Net: a multiscale supervised fusion network for building change detection in high-resolution remote sensing images. IEEE Access. 10:30925–30938. doi:10.1109/ACCESS.2022.3160163.
  • Chen H, Qi Z, Shi Z. 2022. Remote sensing image change detection with transformers. IEEE Trans Geosci Remote Sens. 60:1–14. doi:10.1109/TGRS.2021.3095166.
  • Chen H, Shi Z. 2020. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote. Sens. 12(10):1662. doi:10.3390/rs12101662.
  • Chen D, Wang Y, Shen Z, Liao J, Chen J, Sun S. 2021. Long time-series mapping and change detection of coastal zone land use based on google earth engine and multi-source data fusion. Remote Sens. 14(1):1. doi:10.3390/rs14010001.
  • Chen Z, Zhou Y, Wang B, Xu X, He N, Jin S, Jin S. 2022. EGDE-Net: a building change detection method for high-resolution remote sensing imagery based on edge guidance and differential enhancement. ISPRS J Photogramm Remote Sens. 191:203–222. doi:10.1016/j.isprsjprs.2022.07.016.
  • Chen L, Zhu Y, Papandreou G, Schroff F, Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. European Conference on Computer Vision (ECCV). p. 801–818. doi:10.1007/978-3-030-01234-2_49.
  • Cho K, Merrienboer BV, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. Conference on empirical methods in natural language processing. doi:10.3115/v1/D14-1179.
  • Daudt RC, Le Saux B, Boulch A. 2018 . Fully convolutional siamese networks for change detection. In 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE. p. 4063–4067. doi:10.1109/ICIP.2018.8451652.
  • Du ZH, Li X, Miao J, Huang Y, Shen H, Zhang L. 2024. Concatenated deep-learning framework for multitask change detection of optical and sar images. IEEE J Sel Top Appl Earth Observ Remote Sens. 17:719–731. doi:10.1109/JSTARS.2023.3333959.
  • Eftekhari A, Samadzadegan F, Dadras Javan F. 2023. Building change detection using the parallel spatial-channel attention block and edge-guided deep network. Int J Appl Earth Obs Geoinf. 117:103180. doi:10.1016/j.jag.2023.103180.
  • Fang S, Li K, Li Z. 2023. Changer: feature interaction is what you need for change detection. IEEE Trans Geosci Remote Sens. 61:1–11. doi:10.1109/TGRS.2023.3277496.
  • Fang S, Li K, Shao J, Li Z. 2022. SNUNet-CD: a densely connected siamese network for change detection of VHR images. IEEE Geosci Remote Sens Lett. 19(1):1–5. doi:10.1109/LGRS.2021.3056416.
  • Feizizadeh B, Blaschke T, Tiede D, Moghaddam MH. 2017. Evaluating fuzzy operators of an object-based image analysis for detecting landslides and their changes. Geomorphology. 293:240–254. doi:10.1016/j.geomorph.2017.06.002.
  • Guo E, Fu X, Zhu J, Deng M, Liu Y, Zhu Q, Li H. 2018. Learning to measure change: fully convolutional siamese metric networks for scene change detection. IEEE Trans Multimedia. 18:1–10. ArXiv, abs/1810.09111.
  • Hölbling D, Friedl B, Eisank C. 2015. An object-based approach for semi-automated landslide change detection and attribution of changes to landslide classes in northern Taiwan. Earth Sci Inform. 8(2):327–335. doi:10.1007/s12145-015-0217-3.
  • Huang G, Liu Z, Weinberger KQ. 2016. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . p. 2261 – 2269. doi:10.1109/CVPR.2017.243.
  • Kattenborn T, Leitloff J, Schiefer F, Hinz S. 2021. Review on convolutional neural networks (CNN) in vegetation remote sensing. ISPRS J Photogramm Remote Sens. 173:24–49. doi:10.1016/j.isprsjprs.2020.12.010.
  • Khotimah WN, Boussaid F, Sohel F, Xu L, Edwards D, Jin X, Bennamoun M. 2022. SC-CAN: spectral convolution and channel attention network for wheat stress classification. Remote. Sens. 14(17):4288. doi:10.3390/rs14174288.
  • Li Z, Shen H, Cheng Q, Liu Y, You S, He Z. 2019. Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors. ISPRS J Photogramm Remote Sens. 150:197–212. doi:10.1016/j.isprsjprs.2019.02.017.
  • Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. 2021. Swin transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . p. 9992 – 10002. doi:10.1109/ICCV48922.2021.00986.
  • Li Y, Wang C. 2022. Image splicing tamper detection based on two-channel dilated convolution. 2022 3rd Asia Service Sciences and Software Engineering Conference. Association for Computing Machinery, New York, NY, USA . p. 37–43. doi:10.1145/3523181.3523187.
  • Morgan GR, Wang C, Li Z, Schill SR, Morgan DR. 2022. Deep learning of high-resolution aerial imagery for coastal marsh change detection: a comparative study. IJGI. 11(2):100. doi:10.3390/ijgi11020100.
  • Peng X, Zhong R, Li Z, Li Q. 2021. Optical remote sensing image change detection based on attention mechanism and image difference. IEEE Trans Geosci Remote Sens. 59(9):7296–7307. doi:10.1109/TGRS.2020.3033009.
  • Shi W, Zhang M, Zhang R, Chen S, Zhan Z. 2020. Change detection based on artificial intelligence: state-of-the-art and challenges. Remote Sens. 12(10):1688. doi:10.3390/rs12101688.
  • Song C, Huang B, Ke L, Richards KS. 2014. Remote sensing of alpine lake water environment changes on the Tibetan Plateau and surroundings: a review. ISPRS J Photogramm Remote Sens. 92:26–37. doi:10.1016/j.isprsjprs.2014.03.001.
  • Strudel R, Garcia R, Laptev I, Schmid C. 2021. Segmenter: transformer for Semantic Segmentation. arXiv preprint arXiv:2105.05633
  • Wen D, Huang X, Bovolo F, Li J, Ke X, Zhang A, Benediktsson JA. 2021. Change detection from very-high-spatial-resolution optical remote sensing images: methods, applications, and future directions. IEEE Geosci Remote Sens Mag. 9(4):68–101. doi:10.1109/MGRS.2021.3063465.
  • Woo S, Park J, Lee JY, Kweon IS. 2018. CBAM: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). doi:10.1007/978-3-030-01234-2_1.
  • Wu X, Hong D, Chanussot J. 2022. Convolutional neural networks for multimodal remote sensing data classification. IEEE Trans Geosci Remote Sens. 60:1–10. doi:10.1109/TGRS.2021.3124913.
  • Xu X, Zhou Y, Lu X, Chen Z. 2023. FERA-net: a building change detection method for high-resolution remote sensing imagery based on residual attention and high-frequency features. Remote Sens. 15(2):395. doi:10.3390/rs15020395.
  • Yang Q, Lu T, Zhou H. 2022. A spatio-temporal motion network for action recognition based on spatial attention. Entropy. 24(3):368. doi:10.3390/e24030368.
  • Yang X, Qu D, Si N, Xin S. 2023. Remote sensing image object detection algorithm combining attention. Proceedings of the 2023 9th International Conference on Computing and Artificial Intelligence. Association for Computing Machinery, New York, NY, USA . p. 1–7. doi:10.1145/3594315.3594316.
  • Zhang C, Yue P, Tapete D, Jiang L, Shangguan B, Huang L, Liu G. 2020. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J Photogramm Remote Sens. 166:183–200. doi:10.1016/j.isprsjprs.2020.06.003.
  • Zheng H, Gong M, Liu T, Jiang F, Zhan T, Lu D, Zhang M. 2022. HFA-Net: high frequency attention siamese network for building change detection in VHR remote sensing images. Pattern Recognit. 129:108717. doi:10.1016/j.patcog.2022.108717.
  • Zhou Y, Feng Y, Huo S, Li X. 2022. Joint frequency-spatial domain network for remote sensing optical image change detection. IEEE Trans Geosci Remote Sens. 60:1–14. doi:10.1109/TGRS.2022.3196040.
  • Zhou Z, Siddiquee MM, Tajbakhsh N, Liang J. 2019. UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging. 39(6):1856–1867. doi:10.1109/TMI.2019.2959609.