558
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Extracting urban impervious surface based on optical and SAR images cross-modal multi-scale features fusion network

, , , &
Article: 2301675 | Received 06 Sep 2023, Accepted 31 Dec 2023, Published online: 10 Jan 2024

ABSTRACT

Monitoring the spatiotemporal distribution of urban impervious surface is an essential indication for measuring the urbanization process. Optical and synthetic aperture radar (SAR) images are key data sources in urban impervious surface extraction. Because cities are highly heterogeneous scenes, using a single data source to extract urban impervious surface encounters a bottleneck in accuracy enhancement due to the limitation of a single-modal feature representation, which can be helped to overcome by fusing the two data. However, existing researches have primarily done fusion directly by layer stacking for urban impervious surface, without taking into account the modal differences between optical and SAR (optical-SAR) images, and thus cannot better realize complementarity between the two. As a result, this study proposes a cross-modal multi-scale features fusion segmentation network (CMFFNet) for optical-SAR images for urban impervious surface extraction. A cross-modal features fusion (CMFF) module is designed in the proposed CMFFNet to fully exploit the complementary information of optical-SAR images. Additionally, we propose a multi-scale features fusion (MSFF) module to fuse multi-scale features of optical-SAR images, taking into account the multi-scale characteristics of urban impervious surface. The results of the experiments demonstrate that the proposed CMFFNet outperforms current mainstream methods for extracting impervious surface.

1. Introduction

Urbanization across the globe accelerated quickly in the 1990s as a result of the wave of economic globalization. As of 2018, 55% of the world’s population lives in cities (Hoole, Hincks, and Rae Citation2019). China’s urbanization rate increased from 17.9% in 1978 to 56.1% in 2015 as a result of the complete implementation of the reform and opening up policy in 1978 (Kuang Citation2020). The ‘14th Five Year Plan’ for China projects that by 2035, the country’s urbanization rate will have risen to almost 65%.

In order to suit the needs of urban residents’ daily lives and economic development, the rapid urbanization process necessarily results in the continuous replacement of natural surfaces like farmland, vegetation, and water bodies by artificial surfaces, namely impervious surfaces. Impervious surface refers to the surface covered by impermeable materials, usually including surfaces with low permeability such as roofs, parking lots, and roads (Weng Citation2012). The initial state of the urban ecosystem will be altered by the ongoing increase of impervious surface, which will lead to a decline in the biological processes that support urban life. For instance, impervious surfaces have a significant impact on the surface’s transpiration, aggravating the urban heat island effect (Fu and Weng Citation2016). As an additional illustration, a high ratio of impervious surface will result in four to five times more urban surface runoff during a downpour, raising the possibility of waterlogging in urban areas. Therefore, accurate monitoring of the spatiotemporal distribution of impervious surface is crucial.

As Earth observation technology has advanced, a substantial amount of observational data, particularly open source data like MODIS, Landsat, Sentinel, and others, has been gathered, making remote sensing a widely used method of learning about the Earth’s impervious surface. Owing to the optical remote sensing data’s easy accessibility and high interpretability, researchers have created four main categories of impervious surface extraction techniques using the data: spectral mixing analysis, spectral index method, image classification method, and multi-source data fusion method (Lu et al. Citation2014; Wang and Li Citation2019). The spectral mixing analysis approach takes into consideration that each pixel is composed of multiple ground objects information and primarily targets the problem of mixing pixels in medium and low resolution remote sensing images. When using spectral mixing analysis to extract impervious surface, it obtains the abundance of each ground object on a pixel by pixel by performing spectral unmixing on the image. Finally, it obtains the percentage of impervious surface per pixel by superimposing the abundance of all surface subcategories that are impervious surface (Wu Citation2004; Fan and Deng Citation2014). The choice of end-members, including their number and kind, is crucial to the effective use of spectral mixing analysis, which can be difficult when dealing with intricate urban landscapes. To cope with increasingly high spatial resolution remote sensing images, some scholars regard the extraction of urban impervious surface as a task of land cover classification, and apply this approach to extract impervious surface (Lu and Weng Citation2009; Hu and Weng Citation2011; Shao et al. Citation2019). This type of impervious surface extraction technique is the most developed and established, with study ranging from local regions to large-scale regions such as the country and the world (Gong et al. Citation2020; Zhang et al. Citation2019; Huang et al. Citation2022; Sun et al. Citation2022). To further quickly extract impervious surface in large areas, some scholars have been inspired by the Normalized Difference Vegetative Index (NDVI) to construct a series of impervious surface indexes, such as Normalized Difference Impervious Surface Index (NDISI) (Xu Citation2010), Normalized Difference Impervious Index (NDII) (Wang et al. Citation2015), Biophysical Composition Index (BCI) (Deng and Wu Citation2012), Built-up Areas Saliency Index (BASI) (Shao, Tian, and Shen Citation2014), etc., of which NDISI is the first truly impervious surface index. Although the aforementioned urban impervious surface extraction methods from optical images have achieved considerable accuracy, cities are highly heterogeneous scenes, and spectral confusion between ground objects, such as bare soil and bright impervious surface, has been a major factor hindering the accurate extraction of urban impervious surface from optical images. In addition, the efficient collection of optical data can also be impacted by meteorological conditions like clouds and rain.

To alleviate the aforementioned issue, scholars have begun to consider the benefits of employing synthetic aperture radar (SAR) to penetrate clouds and fog, and extract urban impervious surface by fusing it with optical images. Optical and SAR (optical-SAR) images can be fused at pixel level, feature level, and decision level. Due to its benefits of simplicity of operation and high tolerance for geometric registration errors, feature level fusion has emerged as the most widely utilized fusion approach for extracting urban impervious surface (Wu et al. Citation2022). For example, Guo et al. enhanced the extraction accuracy relative to using a single data source by fusing the spectral properties of SPOT images and the polarization information of fully polarized Radarsat-2 images and feeding them into the decision tree model to extract impervious surface (Guo et al. Citation2014). And Zhang et al. extracted and stacked the texture features from optical-SAR images and input them into the random forest for impervious surface extraction (Zhang, Lin, and Li Citation2015; Lin et al. Citation2020). By combining the polarization characteristics of Sentinel-1 and the multispectral information of Sentinel-2, Sun et al. proposed a step-by-step extraction framework for extracting impervious surface from shadowed areas, minimizing the effect of shadows on the extraction of impervious surface (Sun et al. Citation2022). Additionally, Zhang et al. investigated the effects of feature normalization and fusion levels on the fusion of optical-SAR images-based impervious surface extraction (Zhang, Zhang, and Lin Citation2012; Zhang, Lin, and Li Citation2015; Zhang and Xu Citation2018). Following the successful use of deep learning technology in the field of images, some researchers have investigated the use of deep learning in remote sensing images fusion based on the characteristics of remote sensing images. For example, Zhang et al. proposed a convolutional neural network method with an attention mechanism for features fusion of hyperspectral and Light Detection and Ranging (LiDAR) data for land cover classification, fully utilizing elevation information in LiDAR data and spatial-spectral information in hyperspectral data, thereby providing a new solution for multimodal features fusion (Zhang et al. Citation2023). For another example, Hong et al. and Yao et al. used the emerging network models Generative Adversarial Network (GAN) and Transformer to fuse multimodal remote sensing images for land cover classification, which better eliminated the semantic gap between multimodal data and significantly improved the accuracy of land cover and land use classification (Hong et al. Citation2021; Yao et al. Citation2023). Recently, deep learning has also begun to be used in the extraction of impervious surface in optical-SAR images fusion. Zhang et al. proposed a deep convolution network for extracting impervious surface based on small patches, which achieves prediction of impervious surface by automatically extracting features from optical-SAR images (Zhang et al. Citation2019). Li et al. proposed MCANet, a network that cross-fuses features from optical-SAR images to classify land cover (Li et al. Citation2022).

The majority of current studies on the fusing of optical-SAR image features for the extraction of impervious surface use a straightforward layer stacking approach that ignores the differences in optical-SAR image feature modes. It is essential to know how to fuse the most useful features and suppress irrelevant features during the features fusion process for heterogeneous data sources like optical-SAR. Therefore, this study proposes a cross-modal multi-scale features fusion segmentation network (CMFFNet) using optical-SAR images for urban impervious surface extraction. In the proposed CMFFNet, we design a cross-modal features fusion (CMFF) module to fuse optical-SAR images features and a multi-scale features fusion (MSFF) module to fuse the multi-scale features. Experimental results show that the proposed CMFFNet is more effective at extracting the impervious surface than other mainstream models.

2. Study area and dataset

Accurate urban impervious surface datasets are required as support for the deep learning-based study on impervious surface extraction. Therefore, this study uses Sentinel-2 and Sentinel-1 covering the same ground scene as data sources to construct a multimodal impervious surface dataset, with Sentinel-2 as the optical data source and Sentinel-1 as the SAR data source. The construction of the dataset, including the study area, data preprocessing, and dataset annotation, will be mostly covered in this section.

2.1. Study area

As shown in , the study area is located in Wuhan City, Hubei Province, China, which is a typical urban scenario with land cover types mainly including buildings, roads, water bodies, vegetation, and bare soil. Wuhan, a key city in central China, is a significant industrial, scientific, and educational hub that contributes significantly to the country’s economic and social development. By 2021, Wuhan’s urbanization rate increased to 84.56%, nearly double the 47.4% of the foreign policy of reform and opening up that began in 1978. The impervious surface area of Wuhan has significantly increased as a result of the city’s fast urbanization. Wuhan’s impervious surface area increased five times in the past 30 years, to 1167.63 km2 in 2017. The original state of the urban ecology has been altered by the installation of impervious surface, which has decreased the city’s ability to withstand ecological threats. For instance, Wuhan City experienced two significant floods in 1998 and 2016, resulting in significant damage to both property and people’s lives. Accurately monitoring the growth of urban impervious surface and logically planning urban construction are crucial and required steps to take in order to prevent a repeat of the tragedy. However, Wuhan is located at 113° 41′E- 115° 05′ E and 29° 58′ N - 31° 22′ N, and has a north subtropical monsoon (humid) climate. It receives abundant yearly precipitation, ranging from 1150 to 1450 mm, throughout the year. It is impossible to efficiently monitor impervious surface using optical data alone due to the frequently cloudy and rainy weather. It is critical to develop urban impervious surface extraction techniques based on the fusing of optical-SAR data for regions like Wuhan. As a result, this study ultimately takes Wuhan as the study area.

Figure 1. The geographical location of the study area and some sample data from the dataset used in this study.

Figure 1. The geographical location of the study area and some sample data from the dataset used in this study.

2.2. Dataset

2.2.1. Data preprocessing

For this study, we chose Sentinel-2 and Sentinel-1 images that are taken over the same region and served as the data source. These images cover an area of more than 3000 km2, or roughly one-third of the total city of Wuhan, including the central urban area and surrounding suburbs. Sentinel-1 and Sentinel-2 are two mission satellites for the European Space Agency’s (ESA) Copernicus program, and their data are publicly accessible and shared with users all over the world via the ESA’s official website. The Sentinel-2 image used includes four bands: red, green, blue, and near infrared, with a spatial resolution of 10 m. The Sentinel-1 image used is a dual polarization (VV and VH polarization) ground range detected (GRD) image with a C-band working band and a spatial resolution of 10 m. The data acquisition date is between June 1, 2019 and June 30, 2019. provides specific information on the data from this study.

Table 1. Details of the data used in this study.

depicts the main preprocessing steps of captured images in this study. The essential preprocessing steps for Sentinel-2 data mostly involve atmospheric correction and terrain correction, whereas the necessary preprocessing stages for Sentinel-1 data include speckle filtering, terrain correction, and other steps. This study uses Google Earth Engine (GEE) for data gathering and related preprocessing due to its rapid acquisition of satellite data and robust computer capabilities. Additionally, the study’s final data are acquired by averaging all data collected between June 1, 2019 and June 30, 2019. Especially for Sentinel-1 images, this procedure helps reduce the interference of random noise. We manually choose ground control points (GCPs) based on visual interpretation to register Sentinel-1 and Sentinel-2 images after obtaining preprocessed Sentinel-1 and Sentinel-2 images,. We then continuously adjust the positions of the corresponding GCPs to reduce the geometric registration error between Sentinel-1 and Sentinel-2 images to less than one pixel. GCPs generally have characteristics that are easy to interpret, such as building corners, road intersections, and so on. Sentinel-1 and Sentinel-2 images are then used to create impervious surface dataset and extract impervious surfaces based on the fusion of optical-SAR images.

Figure 2. Schematic diagram for data preprocessing and data annotations.

Figure 2. Schematic diagram for data preprocessing and data annotations.

2.2.2. Data annotation

Data annotation is the most tedious and crucial stage in the creation of a dataset. In this study, the data annotation method is manual annotation, and the annotation tool is the professional image processing software PhotoShop. Due to severe speckle contamination of SAR images, we visually interpret the optical image to ascertain whether the ground objects category of each pixel in the image belongs to impervious surface or pervious surface. Once the type of land cover has been determined, we use the paint bucket tool in PhotoShop software to mark the pixel category, with the impervious surface marked as 1 and the pervious surface marked as 0. Furthermore, as an additional means of interpretation during the data annotation process, higher spatial resolution Google Earth images are used due to the limited spatial resolution of Sentinel-2 images. Once every pixel has been annotated, the entire impervious surface ground truth can be acquired. This study also corrects the ground truth of impervious surface by comparing it with optical images for verification, which helps to minimize errors and omissions in annotation. The ground truth of the impervious surface annotated based on the optical image is also aligned with the SAR image since the optical-SAR images have been registered. After obtaining the entire ground truth of impervious surface, in order to adapt to the input of deep learning networks, this study divide the entire optical image, SAR image, and the corresponding ground truth of impervious surface into several image blocks with a size of 128 × 128 pixels in steps of 80 pixels, resulting in a total of 5603 image blocks. Additionally, there is an 8:1:1 ratio between the training, test, and validation sets.

3. Methodology

The CMFFNet proposed in this study mainly designs a CMFF module to fuse the features of optical-SAR images and a MSFF module to fuse the multi-scale features for urban impervious surface extraction. An extensive overview of the architecture of the proposed CMFFNet model, the CMFF module and the MSFF module will be given in this section.

3.1. CMFFNet structure

In this study, urban impervious surface extraction is considered as a task of image semantic segmentation. The segmentation categories include impervious surface and pervious surface. As seen in , the architecture of the proposed model is primarily composed of three modules: the features extraction module, the CMFF module, and the MSFF module.

Figure 3. Overall structure of the proposed CMFFNet.

Figure 3. Overall structure of the proposed CMFFNet.

First, a batch size of original optical-SAR images and corresponding impervious surface ground truth are input into network to extract multi-scale features through features extraction module with a series of convolution operation. Through forward propagation, the model learns the useful features in the optical-SAR images of the training dataset, and these features match the ground truth pixel for pixel. By contrasting the model’s output with the corresponding ground truth, it is possible to quantify the loss of the model during forward propagation. The network is then optimized by the model using algorithms for loss minimization and back propagation.

Next, the features extracted from each optical-SAR images separately are then input into the CMFF module for features fusion to fully exploit the advantages of optical-SAR images, improve the complementarity of optical-SAR images information, and enhance the ability of models to extract urban impervious surface.

Then, the features after cross-modal features fusion are input into the MSFF module. After upsampling with different scale factors, the fusion features at various scales have the same size, and then these multi-scale features with the same size are fused through the MSFF module, enabling the model to integrate the low resolution and high resolution features of the image for impervious surface prediction.

Finally, based on the output feature maps of the MSFF module, the final impervious surface prediction result is obtained through convolution and 4× upsampling operations.

3.2. Features extraction module

The features extraction module used in this study is a pseudo-Siamese network composed of two identical feature extraction branches that extract features from optical-SAR images, respectively, with no weight sharing between the two branches. This is done in consideration of the variations in imaging mechanisms between optical-SAR images. Convolutional neural networks (CNNs), of which VGG (Simonyan and Zisserman Citation2014) and ResNet (He et al. Citation2016) are classic examples, are capable of learning and extracting relevant features of objects by stacking numerous convolutions and pooling layers.

This study adopts the ResNet-18 as the backbone network for extracting features, and its detailed structure is shown in the . The ResNet-18 includes 5 blocks (Block-1 through Block-5 in the ) that extract various features of the image from shallow to deep levels. In the ResNet-18 model, a 7 × 7 convolution and a 3 × 3 max-pooling operation, namely Block-1, are first used to reduce the spatial resolution of the feature maps. Then, the model extracts more accurate semantic features of the image through four blocks. Each of these four blocks contains a number of 3 × 3 convolutions units with various strides, but the number of output channels varies and rises as the depth of the network increases, serving the dual purposes of downsampling and increasing dimensions. The resolution of these four blocks is controlled by the first unit of the block. In more detail, a block’s input resolution is twice as high as its output resolution. Additionally, ResNet-18 introduces residual connections between every two layers after Block-1, which enables the construction of deeper networks and addresses the degradation issue with deep networks.

Table 2. Details of features extraction module.

3.3. Cross-modal features fusion (CMFF) module

The features of heterogeneous multi-modal data sources, such as optical-SAR data, have their own distinctive essential characteristics (Hong et al. Citation2021). Even though the extraction result will be improved, directly fusing features through layer stacking can cause feature redundancy, making it challenging for models to learn the most useful features between multi-modal data sources, maximizing information complementarity between the two modalities, and even negatively affecting the extraction of urban impervious surface.

According to the idea of attention mechanism (Hu, Shen, and Sun Citation2018), it is expected to fuse multi-modal features more effectively through attention mechanism. Therefore, a CMFF module, whose structure is depicted in , is designed in this study to fuse scale-matched optical-SAR images features. In this module, the optical-SAR images features (assuming that the dimensions of these features are C × H × W) extracted from the pseudo-Siamese network are firstly concatenated together to form features with a dimension of 2C × H × W. After that, the concatenated features are squeezed into a features with a dimension of 1 × 1 × C through a global average pooling layer, which can be represented by the following equation: (1) fsq=Fsq(f2C)=1W×Hi=1Hj=1Wf2C(i,j)(1) where f2C and fsq indicate the concatenated features and the squeezed features, respectvely. Then, two fully connected layers are used for the excitation operation, establishing channel correlation and generating weight values for each feature channel. The first fully connected layer is used to reduce the channel dimension from C to C/16 and is activated by ReLU to reduce the computational complexity. The second fully connected layer restores the channel dimension to the original C, and then is activated by the sigmoid function to get the final weight of each channel. This part can be expressed using the following equation: (2) fex=Fex(fsq,W)=δ(g(fsq,W))=δ(W2δ(W1fsq))(2) where δ() represents the activation function, and W1, W2 are the output features of the two fully connected layers. The weighted feature maps are produced by multiplying the normalized weight matrix by the original input feature maps via channel-wise multiplication, as shown by the following equation: (3) fweighted=fexf2C(3) After that, channel dimensions are decreased using a convolution layer that combines batch normalization (BN), ReLU, and 1 × 1 convolution to produce feature maps with greater granularity.

Figure 4. Schematic diagram of the CMFF module.

Figure 4. Schematic diagram of the CMFF module.

Then a convolution layer consisting of 1 × 1 convolution, BN, and ReLU is used to reduce channel dimensions to generate more refined feature maps frefined. The final optical-SAR images cross-modal fusion features are obtained by multiplying the refined feature maps frefined by the optical-SAR features, respectively, and adding the multiplied results. This process can be described by the following equation: (4) fSAR=frefinedfSAR(4) (5) foptical=frefinedfoptical(5) (6) ffused=fopticalfSAR(6) where and indicate the element-wise multiplication and the element-wise summation. And ffused represents the fused features through the CMFF module. To obtain multi-scale optical-SAR fusion features, this operation processes features at different scales from the features extraction module.

3.4. Multi-scale features fusion (MSFF) module

Roads, buildings, and parking lots make up the urban impervious surface, and they vary in scale in remote sensing imagery. It is challenging to fully extract urban impervious surface of diverse scales using single scale features. Based on this, this study designs, as depicted in , a MSFF module to fuse features of various scales to extract impervious surface. The resolution of features at various scales is first unified in this module through upsampling using various scale factors. To ensure that the spatial resolution of the feature maps is consistent with that of the feature maps of Block-2, the optical-SAR fusion features of Block-5, Block-4, and Block-3 are up-sampled up to 8, 4, and 2 times, respectively. These features are then concatenated together using a concatenation operation, and the fused multi-scale features are then created by fusing and normalizing the concatenated feature maps using 3 × 3 Conv, BN, and ReLU layer operations. The fused multi-scale features have a 128 channel. The entire process of multi-scale features fusion can be represented by the following formula: (7) Ffusedi=upsample(ffusediψ)(i=3,4,5)(7) (8) Fcat=Con(ffused2Ffused3Ffused4Ffused5)(8) (9) fscalefused=ReLU(BN(Conv3×3(Fcat)))(9) where ffusedi indicates the fusion features at different scales obtained through the cross-modal features fusion module, and ψ presents the scale factor of the upsample. The Con() is the concatenation operation, and the Conv3×3(), BN(), ReLU() represents the 3 × 3 Conv, BN, and ReLU operation, respectively.

3.5. Loss function

Given that urban impervious surface extraction is pixel-wise tasks and that there are only two segmentation categories for urban impervious surface extraction, this study uses the commonly used binary cross entropy as the loss to optimize the model. The binary cross entropy can be calculated using the following formula: (10) LBCE=1N[yilog(pi)+(1yi)log(1pi)](10) where N represents the number of samples, yi and pi respectively correspond to the label of the i-th sample and the probability that the i-th sample is predicted to be positive calss. In this study, impervious surface is regarded as the positive class and pervious surface is regarded as the negative class.

3.6. Training details

All experiments in this study are conducted on a 64-bit Ubuntu 16.04 operating system equipped with a 11GB RAM NVIDIA GeForce RTX 2080TI graphics card. The code is run based on the Pytorch deep learning framework.

To ensure fairness in comparison, we use the same parameters and strategies for training in each set of experiments, with a batch size of 32, an initial learning rate of 0.0001, and an epoch number of 400. This study employs the poly learning rate technique to dynamically adjust the learning rate throughout the model training in order to improve training outcomes. Additionally, the model optimizes the object loss function through the Adam optimizer, and combines the backward propagation algorithm to update and calculate model parameters. Furthermore, data augmentation is applied to the training set in order to mitigate overfitting and improve the model’s robustness. On the test and validation set, however, there is no requirement for more data. The data augmentation used in this study includes flipping, random rotation, and Gaussian blur.

4. Results

This section primarily carries out experimental verification based on the dataset created for impervious surfaces in this study to evaluate the performance of the proposed CMFFNet. We first provide detailed information about the experimental setup, including comparison methods and accuracy evaluation indicators of impervious surface extraction. The proposed CMFFNet is then contrasted with other mainstream deep learning methods.

4.1. Comparative methods and evaluation metrics

To evaluate the effectiveness of the proposed CMFFNet in impervious surface extraction, we perform several experiments on our dataset using a number of mainstream segmentation deep learning models as comparison techniques. These comparison techniques, which have all been retrained using our dataset, consist of Deeplabv3+ (Chen et al. Citation2018), HRNet (Sun et al. Citation2019), PSPNet (Zhao et al. Citation2017), BiSeNetV2 (Yu et al. Citation2021) and FCN (Long, Shelhamer, and Darrell Citation2015).

Six impervious surface evaluation metrics are also employed to quantify the effectiveness of our proposed method. These metrics specifically include overall accuracy (OA), Kappa coefficient, mean intersection over union (mIoU), category IoU, producer’s accuracy (PA), and user’s accuracy (UA).

4.2. Results of impervious surface extraction

In this study, we propose an impervious surface extraction model that fuses optical-SAR images cross-modal features. To better illustrate the necessity and effectiveness of fusing optical-SAR images features, some comparative experiments are conducted on the proposed model and the selected comparison methods based on different modal data inputs. Data of different modal include single SAR data, single optical data, and multi-modal data formed by layer stacking optical-SAR images.

shows the extraction accuracy of impervious surface using SAR and optical images alone as the model inputs, respectively. As can be seen from , when optical image alone is used as input to the model, the model achieves considerable accuracy in extracting impervious surface, with OA between 93% and 96%, Kappa between 0.85 and 0.90, and mIoU between 0.86 and 0.91. Among them, Deeplabv3+ has the highest accuracy, with OA, Kappa, and mIoU reaching 95.78%, 0.9040, and 0.9089, respectively. The impervious surface extraction OA, Kappa, and mIoU of the models reduce to between 90% and 92%, 0.77 and 0.82, and 0.79 and 0.84, respectively, when SAR image rather than optical image is utilized as the model input. Additionally, there is no discernible difference in impervious surface extraction accuracy between various models when SAR image is utilized as the model input. This might be because SAR images have a different imaging mechanism than optical and natural images, and they are also heavily contaminated by speckle noise, making it challenging to directly apply the deep learning model designed for natural image characteristics to SAR images and achieve satisfactory segmentation results. According to , the accuracy of the proposed CMFFNet using optical-SAR images as input is significantly higher than that of other models using single optical or SAR data as input. Additionally, we discover that extracting pervious surface is simpler than extracting impervious surface.

Table 3. Results of impervious surface extraction using a single-modal image as input. IS and NIS in this table indicate impervious and pervious surfaces, respectively.

and provide the extraction results of impervious surface for each model using optical image alone as model input and SAR image alone as model input, respectively. As can be seen from , although FCN, PSPNet, and BiSeNetV2 can extract most of impervious surface correctly, their extraction results are flaky and not refined enough. Compared to these three methods, the impervious surface results extracted from HRNet and Deeplabv3+ are more refined, such as the edges of ground objects, which is very close to the ground truth of impervious surface. Furthermore, it is difficult to conclude that the proposed CMFFNet significantly outperforms HRNet and Deeplabv3+ based just on a visual interpretation of . Therefore, it is necessary to combine the quantitative accuracy indicators given in . The findings displayed in indicate that all models, with the exception of the proposed CMFFNet, produce extremely comparable extraction results and are unable to accurately extract impervious surface around the margins of ground objects and roads. This is mostly due to the fact that speckle noise in SAR images causes the edges of ground objects to become blurry, which makes it challenging for the model to identify impervious surface at the edges of ground objects.

Figure 5. Impervious surface extraction results using optical image alone as input.

Figure 5. Impervious surface extraction results using optical image alone as input.

Figure 6. Impervious surface extraction results using SAR image alone as input.

Figure 6. Impervious surface extraction results using SAR image alone as input.

The above qualitative and quantitative experimental results show that the data representation form of single modal is single, and even more advanced and complex models are difficult to further improve the impervious surface extraction accuracy, so it is reasonable to carry out multi-modal fusion of optical-SAR images.

Next, we further combine optical-SAR images into a multi-modal data through layer stacking and input them into FCN, HRNet, Deeplabv3+, PSPNet, and BiseNetV2 models to extract impervious surface, and compare them with the proposed CMFFNet to demonstrate the superiority of the proposed CMFFNet. shows the accuracies of the impervious surface extraction results for various models based on multi-modal data.

Table 4. Results of impervious surface extraction using multi-modal images as input. IS and NIS in this table indicate impervious and pervious surfaces, respectively.

As reflectd in this table, the proposed CMFFNet has the highest extraction accuracy of impervious surface, which is significantly superior to the other five comparison models. The proposed CMFFNet’s OA, Kappa, and mIoU are 0.53%, 0.0119, and 0.0108 higher than that of the Deeplabv3+ with the highest accuracy, respectively, and 3.02%, 0.0689, and 0.0608 higher than that of the BiSeNetV2 model with the lowest accuracy, respectively. Additionally, all models’ extraction accuracy results show that pervious surfaces are easier to identify than impervious surfaces. shows the impervious surface results extracted from each model based on multi-modal data input. The impervious surface extraction results from the proposed CMFFNet are significantly better than other models, which is consistent with the quantitative results reflected in . Compared to other models, the CMFFNet extracts more complete impervious surface, and can also extract more complex impervious surface such as the edges of ground objects and roads. However, other comparison models, especially FCN, PSPNet, and BiSeNetV2, have significant fracture phenomena in the extracted impervious surface results. The aforementioned results demonstrate that the CMFFNet proposed in this study outperforms other mainstream deep learning models in terms of extracting impervious surface.

Figure 7. Results of impervious surface extraction using multi-modal images as input.

Figure 7. Results of impervious surface extraction using multi-modal images as input.

4.3. Results of ablation experiments

Within the proposed CMFFNet, we construct CMFF and MSFF modules taking into account the fact that optical-SAR images belong to heterogeneous data sources and that urban impervious surface has multi-scale characteristics. Two ablation experiments are carried out in this study to demonstrate how well these designed modules. The two ablation experiments include CMFFNet without CMFF and CMFFNet without MSFF. The CMFFNet without CMFF ablation experiment is to remove the CMFF module from the CMFFNet and uses the concatenation operation to replace the CMFF module to fuse optical-SAR features. In the CMFFNet without MSFF ablation experiment, the MSFF module is removed from CMFFNet, and only the features output from Block-5 of optical-SAR images are fused through the CMFF module to predict impervious surface.

The accuracy evaluation of the impervious surface extraction results in ablation studies is displayed in . As refected in this table, the accuracy of CMFFNet is higher than that of the CMFFNet without CMFF ablation experiment, which indicates that, for optical-SAR heterogeneous data sources, it is inappropriate to directly fuse optical-SAR image features through layer stacking and that it is unable to fully exploit the information between optical-SAR images to achieve complementarity. It also illustrates the effectiveness and rationality of the designed cross-modal features fusion module, which might serve as inspiration for further studies centered on the fusion of optical-SAR images. Compared with CMFFNet, the extraction accuracy of impervious surface in the CMFFNet without MSFF ablation experiment is very poor, with OA, Kappa, and mIoU being 9.75%, 0.2271, and 0.1853 lower than CMFFNet, respectively. This indicates that it is not feasible to consider only one scale feature during the process of extracting urban impervious surface, as well as the efficacy of the MSFF module designed for this study.

Table 5. Accuracy evaluation of impervious surface extraction results in ablation studies. IS and NIS in this table indicate impervious and pervious surfaces, respectively.

presents the impervious surface extraction results from the ablation experiments, which align with the occurrence depicted in . Compared to the results of CMFFNet, the results of the CMFFNet without CMFF ablation experiment have more missing impervious surface extraction phenomena at roads and bridges (columns 6 and 7 in ). The results of the CMFFNet without MSFF ablation experiment are the worst, with the extracted impervious surface being very rough and distributed in a sheet shape, and there is a significant overestimation of impervious surface. This is mostly due to the fact that only the features of Block-5, the final layer of the features extraction module, are taken into account for impervious surface prediction in this ablation experiment. Nevertheless, Block-5’s feature maps have a spatial resolution that is only 1/32 of the original input images. Therefore, the model performs a 32-times upsampling on the impervious surface prediction result of Block-5 to generate an impervious surface extraction result consistent with the spatial resolution of the original image. Thus, the effectiveness and rationale of the CMFF and MSFF modules designed in CMFFNet for this study have been confirmed based on the aforementioned ablation experimental results.

Figure 8. Impervious surface extraction results of ablation studies.

Figure 8. Impervious surface extraction results of ablation studies.

5. Discussion

In the extraction of urban impervious surface, compared to single-modal data, how about the specific improvement of optical-SAR images fusion? This section mainly discusses this issue. shows the extraction accuracies of impervious surface from CMFFNet using optical images alone, SAR images alone, and optical-SAR images fusion as inputs. From this table, it can be seen that CMFFNet using optical-SAR multi-modal images as input has the highest extraction accuracy of impervious surface, with OA, Kappa, and mIoU higher than that of CMFFNet using SAR images alone as input by 4.72%, 0.1083, and 0.0937, respectively, and 0.38%, 0.0085, and 0.0077 higher than that of CMFFNet using optical images alone as input, respectively. The impervious surface extracted from CMFFNet models using various modal images as inputs is displayed in . When it comes to extracting impervious surfaces, CMFFNet employing optical or SAR single-modal data as input has issues with cracks and discontinuities, and the extraction result is worse than when CMFFNet uses multi-modal data as input. The CMFFNet proposed in this study can effectively mine and learn multi-modal information between optical-SAR images and overcome the bottleneck of inadequate representation of single-modal data, as evidenced by the above quantitative and qualitative experimental results.

Figure 9. Impervious surface extraction results of CMFFNet with different modals as input.

Figure 9. Impervious surface extraction results of CMFFNet with different modals as input.

Table 6. Comparison of impervious surface results extracted from CMFFNet with different modals as input. IS and NIS in this table indicate impervious and pervious surfaces, respectively.

Furthermore, in order to better understand the improvement of optical-SAR images fusions on urban impervious surface extraction compared to single-modal data, we quantitatively evaluates the improvement of impervious surface in each model using optical-SAR images fusion as input and optical image alone as input relative to SAR image alone as input, as shown in . As can be seen from this figure, compared to using SAR image alone as input, the accuracy of the impervious surface results extracted from all models using optical-SAR images fusion as input and optical image alone as input has significantly improved, especially when using optical-SAR images as input, but the degree of improvement is different. Among them, the improvement of the proposed CMFFNet is the most obvious, while the improvement of HRNet and Deeplabv3+ are also relatively considerable, but the improvement degree is lower than the proposed CMFFNet. In conclusion, fusing optical-SAR images is an effective way to improve the accuracy of extracting urban impervious surface.

Figure 10. Improvement of impervious surface extraction of each model using optical-SAR images fusion as input and optical image alone as input relative to SAR image alone as input. (a)-(c) are the improvement of OA, Kappa and mIoU, respectively.

Figure 10. Improvement of impervious surface extraction of each model using optical-SAR images fusion as input and optical image alone as input relative to SAR image alone as input. (a)-(c) are the improvement of OA, Kappa and mIoU, respectively.

6. Conclusions

Monitoring the spatiotemporal distribution of urban impervious surface is crucial for advancing scientific urban planning as it serves as a key indicator of the urbanization process. Optical-SAR images are two important data sources for extracting urban impervious surface. The extraction of urban impervious surface using optical or SAR images alone encounters a bottleneck in ccuracy enhancement due to the limitation of a single-modal feature representation; the features fusion of the two data will aid in overcoming this bottleneck. The majority of previous researches on fusing optical-SAR images to extract impervious surface, however, has not taken into account the modal differences between optical-SAR images and has instead performed feature fusion directly through layer stacking, making it unable to better accomplish complementarity between the two. To address this issue, this study proposes a CMFFNet for optical-SAR images to extract urban impervious surface. In the proposed CMFFNet, a CMFF module is designed to fully utilize the complementary information of optical-SAR images. Furthermore, we propose a MSFF module to fuse the multi-scale features of optical-SAR images, taking into account the multi-scale characteristics of urban impervious surface. Accoding to experimental validation, main conlusions can be drawn from this study, as follows: Firstly, the proposed CMFFNet outperforms other mainstream models in terms of impervious surface extraction, with an OA of 96.45%; Secondly, the optical-SAR images CMFF module proposed in this study is more effective than commonly used features concatenation fusion strategies; Thirdly, considering the multi-scale characteristics of impervious surfaces and proposing a MSFF module is reasonable and effective; Last but not least, fusing optical-SAR images can significantly improve the accuracy of extracting urban impervious surface. Based on these findings, we will develop more sophisticated multi-modal features fusion models for optical-SAR images in the future to further improve the accuracy of urban impervious surface extraction.

Data availability statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by National Natural Science Foundation of China [grant number 42090012]; Key R & D project of Sichuan Science and Technology Plan [grant number 2022YFN0031]; Sichuan Science and Technology Program [grant number 2023YFN0022]; Zhizhuo Research Fund on Spatial–Temporal Artificial Intelligence[grant number ZZJJ202202]; the Special Fund of Hubei Luojia Laboratory [grant number 220100009]; Zhuhai industry university research cooperation project of China [grant number ZH22017001210098PWC].

References

  • Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. “Encoder-decoder with Atrous Separable Convolution for Semantic Image Segmentation.” Proceedings of the European Conference on Computer Vision (ECCV).
  • Deng, Chengbin, and Changshan Wu. 2012. “BCI: A Biophysical Composition Index for Remote Sensing of Urban Environments.” Remote Sensing of Environment 127: 247–259. https://doi.org/10.1016/j.rse.2012.09.009.
  • Fan, Fenglei, and Yingbin Deng. 2014. “Enhancing Endmember Selection in Multiple Endmember Spectral Mixture Analysis (MESMA) for Urban Impervious Surface Area Mapping Using Spectral Angle and Spectral Distance Parameters.” International Journal of Applied Earth Observation and Geoinformation 33: 290–301. https://doi.org/10.1016/j.jag.2014.06.011.
  • Fu, Peng, and Qihao Weng. 2016. “A Time Series Analysis of Urbanization Induced Land Use and Land Cover Change and Its Impact on Land Surface Temperature with Landsat Imagery.” Remote Sensing of Environment 175: 205–214. https://doi.org/10.1016/j.rse.2015.12.040.
  • Gong, Peng, Xuecao Li, Jie Wang, Yuqi Bai, Bin Chen, Tengyun Hu, Xiaoping Liu, Bing Xu, Jun Yang, and Wei Zhang. 2020. “Annual Maps of Global Artificial Impervious Area (GAIA) Between 1985 and 2018.” Remote Sensing of Environment 236: 111510. https://doi.org/10.1016/j.rse.2019.111510.
  • Guo, Huadong, Huaining Yang, Zhongchang Sun, Xinwu Li, and Cuizhen Wang. 2014. “Synergistic Use of Optical and PolSAR Imagery for Urban Impervious Surface Estimation.” Photogrammetric Engineering & Remote Sensing 80 (1). https://doi.org/10.14358/PERS.80.1.91.
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2016.90.
  • Hong, Danfeng, Jingliang Hu, Jing Yao, Jocelyn Chanussot, and Xiao Xiang Zhu. 2021. “Multimodal Remote Sensing Benchmark Datasets for Land Cover Classification with a Shared and Specific Feature Learning Model.” ISPRS Journal of Photogrammetry and Remote Sensing 178: 68–80. https://doi.org/10.1016/j.isprsjprs.2021.05.011.
  • Hong, Danfeng, Jing Yao, Deyu Meng, Zongben Xu, and Jocelyn Chanussot. 2021. “Multimodal GANs: Toward Crossmodal Hyperspectral–Multispectral Image Segmentation.” IEEE Transactions on Geoscience and Remote Sensing 59 (6): 5103–5113. https://doi.org/10.1109/TGRS.2020.3020823.
  • Hoole, Charlotte, Stephen Hincks, and Alasdair Rae. 2019. “The Contours of a new Urban World? Megacity Population Growth and Density Since 1975.” Town Planning Review 90 (6). https://doi.org/10.3828/tpr.2019.41.
  • Hu, Jie, Li Shen, and Gang Sun. 2018. “Squeeze-and-excitation Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Hu, Xuefei, and Qihao Weng. 2011. “Impervious Surface Area Extraction from IKONOS Imagery Using an Object-Based Fuzzy Method.” Geocarto International 26 (1): 3–20. https://doi.org/10.1080/10106049.2010.535616.
  • Huang, Xin, Yihong Song, Jie Yang, Wenrui Wang, Huiqun Ren, Mengjie Dong, Yujin Feng, Haidan Yin, and Jiayi Li. 2022. “Toward Accurate Mapping of 30-m Time-Series Global Impervious Surface Area (GISA).” International Journal of Applied Earth Observation and Geoinformation 109: 102787. https://doi.org/10.1016/j.jag.2022.102787.
  • Kuang, Wenhui. 2020. “70 Years of Urban Expansion Across China: Trajectory, Pattern, and National Policies.” Science Bulletin 65 (23): 1970–1974. https://doi.org/10.1016/j.scib.2020.07.005.
  • Li, Xue, Guo Zhang, Hao Cui, Shasha Hou, Shunyao Wang, Xin Li, Yujia Chen, Zhijiang Li, and Li Zhang. 2022. “MCANet: A Joint Semantic Segmentation Framework of Optical and SAR Images for Land use Classification.” International Journal of Applied Earth Observation and Geoinformation 106: 102638. https://doi.org/10.1016/j.jag.2021.102638.
  • Lin, Yinyi, Hongsheng Zhang, Hui Lin, Paolo Ettore Gamba, and Xiaoping Liu. 2020. “Incorporating Synthetic Aperture Radar and Optical Images to Investigate the Annual Dynamics of Anthropogenic Impervious Surface at Large Scale.” Remote Sensing of Environment 242: 111757. https://doi.org/10.1016/j.rse.2020.111757.
  • Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2015.7298965.
  • Lu, Dengsheng, Guiying Li, Wenhui Kuang, and Emilio Moran. 2014. “Methods to Extract Impervious Surface Areas from Satellite Images.” International Journal of Digital Earth 7 (2): 93–112. https://doi.org/10.1080/17538947.2013.866173.
  • Lu, Dengsheng, and Qihao Weng. 2009. “Extraction of Urban Impervious Surfaces from an IKONOS Image.” International Journal of Remote Sensing 30 (5): 1297–1311. https://doi.org/10.1080/01431160802508985.
  • Shao, Zhenfeng, Huyan Fu, Deren Li, Orhan Altan, and Tao Cheng. 2019. “Remote Sensing Monitoring of Multi-Scale Watersheds Impermeability for Urban Hydrological Evaluation.” Remote Sensing of Environment 232: 111338. https://doi.org/10.1016/j.rse.2019.111338.
  • Shao, Zhenfeng, Yingjie Tian, and Xiaole Shen. 2014. “BASI: A new Index to Extract Built-up Areas from High-Resolution Remote Sensing Images by Visual Attention Model.” Remote Sensing Letters 5 (4): 305–314. https://doi.org/10.1080/2150704X.2014.889861.
  • Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-scale Image Recognition.” arXiv preprint arXiv:1409.1556.
  • Sun, Genyun, Ji Cheng, Aizhu Zhang, Xiuping Jia, Yanjuan Yao, and Zhijun Jiao. 2022a. “Hierarchical Fusion of Optical and Dual-Polarized SAR on Impervious Surface Mapping at City Scale.” ISPRS Journal of Photogrammetry and Remote Sensing 184: 264–278. https://doi.org/10.1016/j.isprsjprs.2021.12.008.
  • Sun, Zhongchang, Wenjie Du, Huiping Jiang, Qihao Weng, Huadong Guo, Youmei Han, Qiang Xing, and Yuanxu Ma. 2022b. “Global 10-m Impervious Surface Area Mapping: A Big Earth Data Based Extraction and Updating Approach.” International Journal of Applied Earth Observation and Geoinformation 109: 102800. https://doi.org/10.1016/j.jag.2022.102800.
  • Sun, Ke, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. “Deep High-Resolution Representation Learning for Human Pose Estimation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2019.00584.
  • Wang, Zhaoqi, Chencheng Gang, Xueling Li, Yizhao Chen, and Jianlong Li. 2015. “Application of a Normalized Difference Impervious Index (NDII) to Extract Urban Impervious Surface Features Based on Landsat TM Images.” International Journal of Remote Sensing 36 (4): 1055–1069. https://doi.org/10.1080/01431161.2015.1007250.
  • Wang, Yuliang, and Mingshi Li. 2019. “Urban Impervious Surface Detection from Remote Sensing Images: A Review of the Methods and Challenges.” IEEE Geoscience and Remote Sensing Magazine 7 (3): 64–93. https://doi.org/10.1109/MGRS.2019.2927260.
  • Weng, Qihao. 2012. “Remote Sensing of Impervious Surfaces in the Urban Areas: Requirements, Methods, and Trends.” Remote Sensing of Environment 117: 34–49. https://doi.org/10.1016/j.rse.2011.02.030.
  • Wu, Changshan. 2004. “Normalized Spectral Mixture Analysis for Monitoring Urban Composition Using ETM+ Imagery.” Remote Sensing of Environment 93 (4): 480–492. https://doi.org/10.1016/j.rse.2004.08.003.
  • Wu, Wenfu, Zhenfeng Shao, Xiao Huang, Jiahua Teng, Songjing Guo, and Deren Li. 2022. “Quantifying the Sensitivity of SAR and Optical Images Three-Level Fusions in Land Cover Classification to Registration Errors.” International Journal of Applied Earth Observation and Geoinformation 112: 102868. https://doi.org/10.1016/j.jag.2022.102868.
  • Xu, Hanqiu. 2010. “Analysis of Impervious Surface and its Impact on Urban Heat Environment Using the Normalized Difference Impervious Surface Index (NDISI).” Photogrammetric Engineering & Remote Sensing 76 (5): 557–565. https://doi.org/10.14358/PERS.76.5.557.
  • Yao, Jing, Bing Zhang, Chenyu Li, Danfeng Hong, and Jocelyn Chanussot. 2023. “Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework.” IEEE Transactions on Geoscience and Remote Sensing 61. https://doi.org/10.1109/TGRS.2023.3284671.
  • Yu, Changqian, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. 2021. “Bisenet v2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation.” International Journal of Computer Vision 129 (11): 3051–3068. https://doi.org/10.1007/s11263-021-01515-2.
  • Zhang, Hongsheng, Hui Lin, and Yu Li. 2015. “Impacts of Feature Normalization on Optical and SAR Data Fusion for Land use/Land Cover Classification.” IEEE Geoscience and Remote Sensing Letters 12 (5): 1061–1065. https://doi.org/10.1109/LGRS.2014.2377722.
  • Zhang, Hongsheng, Luoma Wan, Ting Wang, Yinyi Lin, Hui Lin, and Zezhong Zheng. 2019. “Impervious Surface Estimation from Optical and Polarimetric SAR Data Using Small-Patched Deep Convolutional Networks: A Comparative Study.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (7): 2374–2387. https://doi.org/10.1109/JSTARS.2019.2915277.
  • Zhang, Hongsheng, and Ru Xu. 2018. “Exploring the Optimal Integration Levels Between SAR and Optical Data for Better Urban Land Cover Mapping in the Pearl River Delta.” International Journal of Applied Earth Observation and Geoinformation 64: 87–95. https://doi.org/10.1016/j.jag.2017.08.013.
  • Zhang, Haotian, Jing Yao, Li Ni, Lianru Gao, and Min Huang. 2023. “Multimodal Attention-Aware Convolutional Neural Networks for Classification of Hyperspectral and LiDAR Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16: 3635–3644. https://doi.org/10.1109/JSTARS.2022.3187730.
  • Zhang, Hongsheng, Yuanzhi Zhang, and Hui Lin. 2012. “A Comparison Study of Impervious Surfaces Estimation Using Optical and SAR Remote Sensing Images.” International Journal of Applied Earth Observation and Geoinformation 18: 148–156. https://doi.org/10.1016/j.jag.2011.12.015.
  • Zhao, Hengshuang, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. “Pyramid Scene Parsing Network.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.