Full article: Water extraction from optical high-resolution remote sensing imagery: a multi-scale feature extraction network with contrastive learning

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Accurately spatiotemporal distribution of water bodies is of great importance in the fields of ecology and environment. Recently, convolutional neural networks (CNN) have been widely used for this purpose due to their powerful features extraction ability. However, the CNN methods have two limitations in extracting water bodies. First, the large variations in both the spatial and spectral characteristics of water bodies require that the CNN-based methods have the ability of extracting multi-scale features and using multi-layer features. Second, collecting enough samples is a difficult problem in the training phase of CNN. Therefore, this paper proposed a multi-scale features extraction network (MSFENet) for water extraction, and its advantages are contributed to two distinct features: (1) scale features extractor (MSFE) is designed to extract multi-layer multi-scale features of water bodies; (2) contrastive learning (CL) is adopted to reduce the sample size requirement. Experimental results show that MSFE can effectively improve the small water body extraction performance, and the CL can significantly improve the extraction accuracy when the training sample size is small. Compared with other methods, MSFENet achieves the highest F1-score and kappa coefficient in two datasets. Furthermore, spectral variability analysis shows that MSFENet is more robust than other neural networks in a spectrum variation scenario.

KEYWORDS:

1 Introduction

Water is the source of life and essential for land ecological system (Du, Ottens, and Sliuzas Citation2010), which has a significant impact on public health, economic development and living environment (J. J. Li et al. Citation2021). Therefore, it is of great importance to obtain spatiotemporal distribution of surface water bodies timely and accurately. With rapid developments of the earth observation technologies in recent decades, optical remote sensing images with high spatial, temporal, and spectral resolution have become more accessible (Wang et al. Citation2022). Hence, it is a promising way to use optical high-resolution images for surface water body extraction. Although optical high-resolution remote sensing images can provide much useful information for surface water body extraction (Chen et al. Citation2020), including spectral features (Zhou et al. Citation2021), texture features, etc., there are still some challenges for this task. Firstly, water bodies are much similar with low-albedo objects in image features such as shadows, thus they are easily confused with these dark objects. Secondly, spectral signals of water bodies vary greatly in remote sensing images due to the different solar altitude angles, the interference of atmospheric conditions and topography. Thirdly, the differences in both sediment content of water bodies and density of aquatic plants will cause spectral variations of water bodies in remote sensing images. Finally, small target detection in remote sensing images is always a difficult problem because small objects can be easily affected by their surroundings (Zhou et al. Citation2022). In one word, the extraction of small water bodies is still challenging.

The methods of water extraction from optical high-resolution remote sensing images mainly include water index methods (WIMs) and image classification methods (ICMs). Generally, WIMs have three steps for water extraction: (1) selecting the bands closely related to identifying water bodies in terms of spectral characteristics; (2) constructing different water index models by combining water-correlated spectral bands; (3) determining a threshold to classify water and non-water (Su et al. Citation2021). To date, many water indexes have been proposed for water extraction. (McFeeter Citation1996) proposed the normalized difference water index (NDWI) with green and NIR bands. However, NDWI is greatly affected by shadows of buildings, so it is difficult for water extraction in built-up areas. To overcome shortcomings of NDWI, Xu (Citation2005) developed the modified normalized difference water index (MNDWI) by replacing NIR band in NDWI with SWIR band. Yan, Zhang, and Zhang (Citation2007) proposed the enhanced water index (EWI) by combining NDWI and MNDWI. Feyisa et al. (Citation2014) developed the automated water extraction index (AWEI) using multiple bands (1, 2, 4, 5 and 7) of Landsat 5 TM to reduce noise affects in built-up and mountainous areas. Additionally, there are also other less used water indexes, such as the revised normalized difference water index (RNDWI) (Cao et al. Citation2008), new water index (NWI) (Ding Citation2009), Gaussian normalized difference water index (GNDWI) (Shen et al. Citation2013), false normalized difference water index (FNDWI) (Zhou et al. Citation2014), and shadow water index (SWI) (Chen et al. Citation2015). Although WIMs can extract waterbodies simply and quickly, the results of WIMs are often unsatisfactory because the spectral similarity between waterbodies and low-albedo objects and the spectral variability within water bodies. Moreover, the thresholds of WIMs should be determined appropriately to produce accurate water maps, which is not an easy task. Finally, since high spatial resolution imagery usually only has four spectral bands (i.e. blue, green, red and NIR), most WIMs methods are not applicable.

Various ICMs have been proposed to overcome the shortcomings of WIMs and utilize spatial information of optical high-resolution remote sensing images (Nath and Deb Citation2010). ICMs perform water extraction by combining spectral, shape and texture features and using various classifiers in machine learning. Commonly used classifiers in ICMs include decision tree (Fu, Wang, and Li Citation2008), random forest (RF) (Cui, Wang, and Huang Citation2022), support vector machine (SVM) (Nandi, Srivastava, and Shah Citation2017), expert system (Pekel et al. Citation2016), etc. Although ICMs can achieve better results than WIMs with various features, they need to manually construct these features for a specific water extraction task. Additionally, the scope of application of these manually constructed features is limited, and it is difficult to extract water bodies in different regions.

In recent years, deep learning, especially convolutional neural networks (CNN), provides an effective approach for automatically learning features at multiple levels (Chen et al. Citation2016), and has been widely used in scene classification (Liu, Zhong, and Qin Citation2018; De Lima and Marfurt Citation2020), object detection (Ren et al. Citation2017; Chen, Zhang, and Ouyang Citation2018), and semantic segmentation (Shelhamer, Long, and Darrell Citation2017; W.Z. Zhao et al. Citation2017). As a specific semantic segmentation task, many CNN models have been designed for water extraction. Chen et al. (Citation2018) proposed a self-adaptive pooling layer for water extraction to reduce the loss of features in the pooling process. Chen et al. (Citation2020) performed spatial-spectral convolution to extract features from both the spatial and the spectral dimensions by factorizing 3D-convolutions into 2D-spatial convolutions and 1D-spectral convolutions. M.Y. Li et al. (Citation2021) applied DenseBlocks in DenseNet (Huang et al. Citation2017) to construct a dense-local-feature-compression (DLFC) network with each layer receiving all its previous feature maps. In addition, this network can extract water bodies from different images of one sensor and different sensors automatically. Wang et al. (Citation2022) proposed a SADA-Net for water extraction, in which both atrous spatial pyramid pooling (ASPP) module and dual attention (DA) module were utilized. Lu et al. (Citation2022) developed a weakly supervised deep learning model named neighbor feature aggregation network (NFANet) to improve the label quality by recursive training.

Compared with ICMs, deep learning methods can automatically extract features which is more beneficial to water extraction. However, lots of samples required in deep neural networks training process are often hardly obtained. To reduce the need of large samples, contrastive learning (CL) is introduced into water extraction in this paper. CL is an unsupervised representation learning method for extracting unsupervised features (He et al. Citation2020). CL aims to learn the representation space by contrasting semantically positive and negative sample pairs, such that the features of positive sample pairs are similar while the features of negative sample pairs are different. Several recent studies have proposed unsupervised visual representation learning using approaches related to contrastive loss to produce promising results (Wu et al. Citation2018; Bachman, Hjelm, and Buchwalter Citation2019). However, the use of image pairs as positive and negative samples for CL in these works is not applicable for pixel-level tasks such as water extraction. Recently, some works (Chaitanya et al. Citation2020; Xie et al. Citation2021; Zhao et al. Citation2021; Bai et al. Citation2022) have shown that CL can give a proper representation for semantic segmentation tasks requiring high-dimensional features at pixel level. Therefore, pixel-wise CL is preferred to pretrain deep learning models to reduce the training sample size.

The variation of water body size is also a challenging problem. While most current methods are developed, they cannot meet the requirement for extracting water bodies with different sizes. In addition, the sizes of feature maps decrease in the process of feature extraction layer by layer, where small water bodies with inconspicuous features tend to be ignored, causing biased results. Therefore, multi-layer and multi-scale features are required to solve these problems. The multi-layer features refer to the feature extracted from different layer of CNN. Specifically, the low-layer features mainly express the detailed information of targets, while the deep-layer features mainly express the overall information of targets, such as the semantic information of targets. The multi-scale features refer to the feature extracted from several different scales. Small-scale features with small receptive fields are suitable for detecting small targets, while large-scale features with large receptive fields are suitable for large targets extraction. There are already some modules for extracting multi-scale features in the field of semantic segmentation, such as spatial pyramid pooling (SPP) (He et al. Citation2015), pyramid pooling module (PPM) in PSPNet (H. S. Zhao et al. Citation2017), and atrous spatial pyramid pooling (ASPP) in DeepLabv2 (Chen et al. Citation2018), etc. However, these modules for multi-scale features extraction without multi-layer settings mainly perform pooling operations on the feature maps, thus leading to the loss of information. Therefore, a multi-scale features extraction network (MSFENet) with multi-scale features extractor (MSFE) is developed to extract multi-layer and multi-scale features for water extraction in this work.

In summary, the main contributions of this work can be summarized as follows:

A novel multi-scale features extractor. A novel module named MSFE is presented for extracting multi-layer multi-scale features, and the MSFENet is constructed based MSFE for water extraction. Compared with previous methods, the MSFENet achieves best performance and is most robust under different scenarios, especially for small water body extraction.
A new training strategy with contrastive learning. CL is adopted to the MSFENet to reduce the demand of large samples in the training processing. Contrastive loss is used in CL to increase the features similarity of the same category of targets and increase the features difference of the different category of targets, thus reducing the reliance of MSFENet on a large number of samples.

2 Methodology

2.1 Msfenet architecture

Like most semantic segmentation networks, the MSFENet is an encoder-decoder architecture and consists of encoder, multi-scale feature extractor (MSFE), and decoder (). Encoder is used for hierarchical extraction of abstract features. And ResNet34 is used as encoder in MSFENet, including an initial convolution layer with stride of two and convolution kernel size of seven, a batch normalization layer, a rectified linear unit (ReLU), a max pooling layer, and four residual blocks (RB). Each RB is composed of several basic blocks (BB), and each BB includes two convolution layers, two batch normalization layers and a ReLU. MSFE can extract multi-layer multi-scale features to adapt to changes in water bodies sizes, including four dense atrous convolution modules (DACM) and four skip connections (SC), and each DACM extract multi-scale features from one SC. Decoder consists of five decoder blocks (DB), which can restore the feature map from multi-layer to the size of input image and obtain water body extraction results. Each DB includes two convolution layers, three batch normalization layers, three ReLUs, and a transpose convolution layer.

Figure 1. The architecture of multi-scale features extraction network (MSFENet), including four residual blocks (RB) which are composed of several basic blocks (BB), four dense atrous convolution modules (DACM), and five decoder blocks (DB).

The MSFENet can be trained with cross-entropy loss which can make the output of the MSFENet close to the ground truth. Considering an image with $N$ pixels, the output $p_{i}$ of the MSFENet represents the probability of the $i^{t h}$ pixel belonging to water and the $y_{i}$ represents whether the $i^{t h}$ pixel is water. Then cross entropy loss can be formulated as follows:

(1)

L_{c e} = \frac{1}{N} \sum_{i = 1}^{N} - [y_{i} l o g (p_{i}) + (1 - y_{i}) l o g (1 - p_{i})]

(1)

2.2 Multi-scale feature extractor

To solve the information loss problem caused by increasing the receptive field through pooling or general convolution, atrous convolution is used for the feature extraction of multiscale information in MSFE. As a special kind of convolution, atrous convolution has a dilation rate parameter in addition to the parameters of general convolution. shows the atrous convolution with different dilation rates. The atrous convolution with a dilation rate of $d$ can be regarded as a general convolution with $d - 1$ zeros between each row and each column of the convolution kernel. The calculation formula of the receptive field of the $n_{t h}$ layer of CNN $r_{n}$ is as follows:

(2)

r_{n} = r_{n - 1} + (k_{n} - 1) \prod_{i = 1}^{n - 1} s_{i}

(2)

Figure 2. Atrous convolution kernels with dilation rate of 1 (a), 3 (b) and 5 (c).

where $k_{n}$ denotes the kernel size of the $n_{t h}$ layer of CNN, and $s_{i}$ denotes the stride of the $i_{t h}$ layer of CNN. Therefore, the atrous convolution can increase the receptive field of CNN by increasing the dilation rate without increasing the amount of CNN parameters.

Each DACM is formed by parallel atrous convolutions with different dilation rates to extract multi-scale features under different receptive fields (). There are three kinds of atrous convolutions in the DACM with the dilation rates 1, 3, and 5, respectively. An $1 \times 1$ convolution is adopted in each atrous branch for rectified linear activation, and the features of all branches and the original features are combined to obtain the output of the DACM. The branch with a large reception field can excel in extracting efficient features for large objects, while the branch with a small receptive field can extract features for small objects. Through the parallel connections between different branches, the DACM can extract multi-scale features for objects with different sizes. The MSFE consists of four SCs and four DACMs in total, and each DACM is designed to extract features from a corresponding SC. Thus, multi-layer multi-scale features can be extracted and used in decoder for final water extraction.

Figure 3. The structure of dense atrous convolution module (DACM).

2.3 Pixel-wise supervised contrastive learning

The key of CL is to select positive and negative sample pairs. In most cases, CL is unsupervised, and different images constitute negative sample pairs and an image with its distorted version constitute positive sample pairs. In this work, pixel-wise CL is performed for water extraction since it is a semantic segmentation task. In addition, pixel-wise CL requires pixel-level positive and negative sample pairs, which cannot be achieved by data augmentation. Therefore, labeled images are used to select positive and negative sample pairs. Pixels in two images with same labels are positive sample pairs, while pixels in two images with different labels are negative sample pairs as shown in .

Figure 4. Illustration of pixel-level positive and negative sample pairs. Different colours represent different categories. Ac, Ad, Ba, Bb, Cc, Cd, Dc, and Dd are positive sample pairs. Aa, Ab, Bc, Bd, Ca, Cb, Da, and Db are negative sample pairs.

The purpose of CL is that features learned from positive samples are similar while feature learned from negative samples are different, which is generally achieved by contrastive loss. Let $I$ denote an image, $I^{'}$ its distorted version, $N_{I}$ the number of pixels in image $I$ , and $N_{I^{'}}$ the number of pixels in image $I^{'}$ . The contrastive loss can be formulated as follows:

(3)

L_{c l} = \frac{1}{\sum_{k = 1}^{N_{I}} N_{k}^{+}} \sum_{k = 1}^{N} \sum_{i = 1}^{N_{k}^{+}} - l o g \frac{e^{(z_{k} \cdot z_{i}) / τ}}{e^{(z_{k} \cdot z_{i}) / τ} + \sum_{j = 1}^{N_{k}^{-}} e^{(z_{k} \cdot z_{j}) / τ} / N_{k}^{-}}

(3)

where $N_{k}^{+}$ denotes the number of positive samples of pixel $k$ , $N_{k}^{-}$ denotes the number of negative samples of pixel $k$ , $N_{I^{'}} = N_{k}^{+} + N_{k}^{-}$ , $z$ denotes normalized deep features extracted which are the output of the fourth DB in this work, $(\cdot)$ denotes dot product of two vectors, and $τ > 0$ is a hyper-parameter. Therefore, the final loss function of the MSFENet is:

(4)

L = L_{c e} + L_{c l}

(4)

2.4 Accuracy assessment

Four evaluation metrics are used for accuracy assessment in water body extraction, which are precision (P), recall (R), F1-score (F1), and kappa coefficient (K), respectively. These four-evaluation metrics can be computed as follows:

(5)

P = \frac{T P}{T P + F P} \times 100 %

(5)

(6)

R = \frac{T P}{T P + F N} \times 100 %

(6)

(7)

F 1 = 2 \times \frac{P \times R}{P + R} \times 100 %

(7)

(8)

K = \frac{N * C - (R C 1 + R C 2)}{N^{2} - (R C 1 + R C 2)} \times 100 %

(8)

(9)

C = T P + T N

(9)

(10)

N = T P + T N + F N + F P

(10)

(11)

R C 1 = (T P + F N) \times (T P + F P)

(11)

(12)

R C 2 = (T N + F P) \times (T N + F N)

(12)

where $T P$ denotes true positive, which represents the correctly extracted number of water pixels, $F P$ denotes false positive, which represents incorrectly extracted number of water pixels, $T N$ denotes true negative, which represents the correctly extracted number of non-water pixels, and $F N$ denotes false negative, which represents the incorrectly extracted number of non-water pixels. The larger values of these four metrics, the better the water extraction results are.

3 Dataset

3.1 Data introduction

Gaofen Image Dataset (GID) is a large-scale land-cover dataset produced from GF-2 satellite images. GID consists of two parts: a fine land-cover classification set (FLCCS) and a large-scale classification set (LSCS) (Tong et al. Citation2020). There are fifteen land-cover categories in the FLCCS as well as five land-cover categories in the LSCS. Water is a separate category in the LSCS while it is divided into several subcategories in the FLSSC, thus the LSCS is utilized for water extraction for convenience. The LSCS contains 150 GF-2 images which are distributed in nearly 60 cities in China, and each GF-2 image has a corresponding pixel-level labeled image that indicates the category of each pixel. Though water is included in the LSCS, the labels of water are not accurate. Therefore, four GF-2 images are selected from LSCS for the experiments in this article, while water bodies in these images are relabeled by carefully visual interpretation (). The image size and spatial resolution of each GF-2 image areis 3.2 m and $6800 \times 7200$ , $m$ respectively. Four bands are included in GF-2 images (i.e, NIR, red, green, and blue). Different water bodies, including ponds, lakes, and rivers with different sizes and different characteristics, are included in study areas. In addition, low-albedo objects such as shadows are widespread in study areas and can be easily confused with water bodies. Therefore, the datasets used in this article is very suitable for evaluating water extraction models.

Figure 5. Four selected GF-2 images and the corresponding labelled images. Row a shows standard false color of GF-2 images, Row B shows labelled images in LSCS, and Row C shows relabelled images by visual interpretation.

3.2 Data preprocessing

The four GF-2 images and corresponding relabeled images are divided into training set, validation set and test set, in which (a) and (b) in are divided into training set, (c) is selected as validation set, and (d) is selected as test set. The training set is used to train learnable parameters of models, the validation set is used to tune the hyper-parameters of models such as learning rate, while the test set is used to evaluate accuracy of models. Before model training, the images and their corresponding relabeled images in the training set and validation set are cropped into image patches with size of $512 \times 512$ and a cropping interval of 256 pixels. Finally, 1566 image patches and corresponding label patches are obtained for model training, while 783 image patches and corresponding label patches are obtained for model validation. To make it easier for the model to converge, all image patches are divided by 255 before being fed into the model.

In accuracy assessment phase, when large water bodies are well extracted, the evaluation for small water bodies will be biased because of imbalanced samples. To better evaluate the water extraction accuracy, two regions mainly containing small water bodies in the test set are selected to evaluate the results of different models on small water bodies ().

Figure 6. Two regions mainly containing small water bodies, (a), (b), and (c) are GF-2 image, labelled image in LSCS, and relabelled image, respectively.

The targets with sizes of less than $32 \times 32$ pixels are usually defined as small targets in the field of object detection. Therefore, water bodies with the number of pixels less than 1000 are regarded as small objects in this paper. The area distribution histogram and cumulative area distribution histogram of the water bodies in regions 1 and 2 are shown in . The proportion of small water bodies in these two regions are 97.57% and 88.69% respectively, which are suitable for evaluating the ability of the model for extracting small water bodies. Assuming that is the water extraction result obtained by a model, is the ground truth, and the four metrics are shown in . For the entire test set, although only a few large water bodies are extracted, the kappa coefficient is still close to 70%, and the F1-score even exceeds 70%. This result obviously does not reflect the accurate accuracy of water extraction results well. For region 1, since no water bodies are extracted, precision and F1-score cannot be calculated, and the values of recall and kappa coefficient are zero. For region 2, the values of F1-score and kappa coefficient are 18.51%, and 17.28%, respectively. The evaluation results of these two local regions can better reflect the overall accuracy of water extraction results.

Figure 7. The water area distribution histogram of region 1 (a), region 2 (b), and cumulative water area distribution histogram of region 1 (c), region 2 (d).

Table 1. Accuracy assessment results of water label image in LSCS.

Download CSV Display Table

4 Experiments and analysis

4.1 Implementation details

All codes are implemented by Python and C++. Pytorch deep learning framework in Python is utilized to implement the MSFENet and other deep learning methods. GDAL library in C++ is employed to clip training samples in training phase, clip validation samples in validation phase, clip image patches in prediction phase, and merge water prediction results and other deep learning methods. An i7-12700F center processing unit and a 12 GB NVIDIA GeForce RTX 3060 graphics processing unit are used to run the experiments.

During the training phase, distorted images are obtained by contrast enhancement in CL. The optimizer, initial learning rate, max epochs and training batch size used in all experiments were Adam, 0.0002, 50, and 4, respectively. To make the MSFENet converge faster, the pre-trained model parameters on ImageNet are used as initialization parameters for encoder and the learning rate will be halved once the F1-score of the validation set stops increasing for 5 consecutive epochs. In addition, early stopping can help to prevent the MSFENet from overfitting; thus training process will be terminated once the F1-score of the validation set stops improving for 10 consecutive epochs.

During the prediction phase, only original image patches are fed into the MSFENet, while the training phase requires the distorted image patches as the input of the MSFENet. Additionally, due to the limitation of video memory, the remote sensing image in the test set needs to be cropped into image patches for prediction. The most common used method is to crop a remote sensing image into image patches without overlap; however, this method may result in stitching lines during the resultant merging process. To solve this problem, overlapping cropping method is used in the prediction of water bodies using deep learning models as shown in . Each $512 \times 512$ image patch (red boxes in ) obtained by cropping has $512 \times 128$ pixels overlapping with each of the adjacent image patch. The white part inside the red boxes in , i.e. the part beyond the image range, is padded with zero during the cropping processing. All image patches are fed into trained deep learning models to obtain prediction results after image cropping. Only the $384 \times 384$ pixels (yellow boxes) in the center of each prediction patches are considered as valid results, and the final water extraction result is obtained by merging all the $384 \times 384$ prediction patches.

Figure 8. Water bodies prediction process using deep learning models. The yellow boxes represent valid areas after cropping, and the red boxes represent all areas after cropping.

4.2 Ablation study

Ablation experiments with different settings are conducted to verify the influences of each component. The three components, SC, DACM, and CL, are analyzed in detail. The encoder-decoder architecture, i.e. the MSFENet without MSFE proposed in this work, is used as baseline for a fair comparison. Firstly, the influences of multi-layer features obtained by SC on experimental results and small water bodies are discussed in this section. Subsequently, the effects of multi-scale features obtained by DACM on experimental results and small water bodies are discussed. Finally, the impacts of CL on experimental results are discussed.

shows the accuracy assessment results of the four methods: the baseline, the baseline with SC, the MSFENet, and the MSFENet with CL on test set. The MSFENet achieves highest F1-score and kappa coefficient, which are 1.1% and 1.28% higher than the baseline, respectively. The baseline with SC also achieves good results, which are 0.71% and 0.82% higher than the baseline. In addition, compared with the baseline, the MSFENet brings a 2.66% increase in recall at the expense of 0.62% precision. The higher recall means less water bodies are missed; thus, the MSFENet can largely reduce omission rate of water bodies. However, the accuracy assessment result of the MSFENet with CL shows that CL cannot further increase the accuracy of water extraction based on the MSFENet, and this phenomenon will be discussed separately in detail in the next section.

Table 2. Accuracy assessment results of ablation study on test set (bold numbers refer to the highest values).

Download CSV Display Table

Although the accuracy assessment results on test set illustrate that MSFE can effectively improve the accuracy of water extraction, the results cannot fully reflect the effectiveness of MSFE on small water extraction. show accuracy assessment results with different network configurations on regions 1 and 2. In region 1, the MSFENet achieves 1.83% and 1.88% improvements over baseline in F1-score and kappa coefficient, respectively. In region 2, the F1-score and kappa coefficient of the MSFENet are 2.31% and 2.18% higher than those of the baseline, respectively. Compared with the baseline, the MSFENet even achieves 4.61% and 5.23% improvements in recall in regions 1 and 2, which are 1.95% and 2.67% higher than the improvement on the test set. The accuracy assessment results in regions 1 and 2 show that MSFE can effectively extract multi-scale features and reduce the omission rate. In addition, SC is more important to enhance the extraction accuracy of small water bodies than DACM in MSFE. In addition, CL is also not effective in improving the accuracy of water extraction in regions 1 and 2.

Table 3. Accuracy assessment results of ablation study on region 1 (bold numbers refer to the highest values).

Download CSV Display Table

Table 4. Accuracy assessment results of ablation study on region 2 (bold numbers refer to the highest values).

Download CSV Display Table

The results of different network configurations are shown in . Multi-layer features are extracted in the baseline by initial convolution layer and four RBs, but only the most abstract features extracted by the fourth RB are used by the decoder for water extraction. Since the baseline can only extract abstract features resulting in more information loss for water bodies, therefore, small water bodies are missed, and the boundary information of large water bodies is not accurate enough (). Fortunately, SC enables the decoder to make full use of the multi-layer extracted by encoder. For small water body extraction and accurate boundary information acquisition, the lower-layer features that undergo fewer pooling operations are more effective compared to deep-layer features. As shown in , the baseline with SC can detect many small water bodies that are missed by the baseline, and the boundaries of other water bodies are more accurate. However, although adding SC to the baseline can effectively reduce the omission rate, it also increases false alarm rate, which is more obvious in . The accuracy assessment results in , and 4 also show that after adding SC to the baseline, the precision has a significant decrease, which represents an increase of false alarm rate. After adding DACM to the baseline and forming MSFE with SC, the false alarm rate stops dropping and starts to rise again. MSFE is able to extract both multi-layer features and multi-scale features in each layer, which can further reduce the omission rate and make the boundaries of water bodies more accurate, and thus reduces the false alarm rate enhancement. As shown in , none of mis-detected water bodies in exist anymore. The accuracy assessment results in , and 4 also show that the MSFENet achieves a significant improvement in precision than the baseline with SC.

Figure 9. Water extraction results with different network configurations. (a) Image, (b) Ground truth, (c) the Baseline, (d) the Baseline + SC, (e) the MSFENet, and (f) the MSFENet + CL.

4.3 The effectiveness of contrastive learning

In ablation study, CL cannot further effectively improve the water extraction accuracy in the case of using all training samples. The reason is that the amount of data in the training set is much more than that of the test set, the MSFENet is already enough to extract effective features from the training samples when sufficient training samples are used, and CL cannot further improve water extraction accuracy this time. However, when the number of training samples is insufficient, the MSFENet cannot extract effective features from training samples, but adding CL to the MSFENet can effectively increase the representation ability of features, thus improving water extraction accuracy.

To verify the ability of CL, nine experiments are done using 1%, 5%, 10%, 20%, 40%, 60%, 80%, and 100% of the total training samples to explore if CL can increase the water extraction capability of the MSFENet in a small sample size. shows the accuracy curve with the increasing of samples sizes. As the sample size increases, the accuracy improved by the MSFENet + CL becomes lower and lower. It can be seen from , among the nine experiments, the greatest accuracy improvement from CL is achieved when only 1% of the training samples are used to train the MSFENet as shown by the accuracy assessment results in . The F1-score and kappa coefficient on test set yield 2.67% and 3.04% improvement, showing that CL is still very effective when the sample size is small. An interesting finding is that CL is more helpful in extracting small water bodies when the sample size is small, leading to 22.54% and 13.31% improvements of F1-score and kappa coefficient in region 1, and 4.72% and 4.96% improvements of F1-score and kappa coefficient in region 2. Therefore, CL is still crucial for water extraction when samples are obviously insufficient for a large-scale mission.

Figure 10. Accuracy curves with the increasing of sample sizes. (a) test set, (b) region 1, and (c) region 2.

Table 5. Accuracy assessment results when 1% of the training samples are used (bold numbers refer to the highest values).

Download CSV Display Table

4.4 Comparison with other methods

The proposed MSFENet is compared with NDWI, RF, FCN (Shelhamer, Long, and Darrell Citation2017), PSPNet (Zhao et al. Citation2017), UNet (Ronneberger, Fischer, and Brox Citation2015), and Deeplabv3+ (Chen et al. Citation2018) on the selected four GF-2 images in GID. NDWI method classifies the pixels with NDWI greater than 0.3 as water bodies. For the RF, the images were segmented into image objects and then object-based features were extracted for training RF. The rest four deep learning methods were implemented by the MMSegmentation (MMSC Citation2020), and the backbone of FCN, PSPNet, and Deeplabv3+ are ResNet34. All methods except NDWI are trained with train set, and all methods are evaluated with test set.

The values of the four metrics are shown in . Compared with those of NDWI, RF, FCN, PSPNet, UNet, and Deeplabv3+, the F1-score of the MSFENet achieves a significant improvement of 2.38%, 3.62%, 3.36%, 3.33%, 1.53%, and 2.88%, respectively, the kappa coefficient of the MSFENet achieves a significant improvement of 2.96%, 4.2%, 3.89%, 3.86%, 1.77%, and 3.32%, respectively. The UNet method has the second highest F1-score and kappa coefficient. The reason is that UNet also has multiple SCs, which can extract multi-layer feature of water bodies. The effectiveness of SC for water extraction has been verified in ablation study, but the accuracy of UNet for water extraction still does not reach the level of the proposed baseline. NDWI achieves the third highest accuracy, and the main reason for its high accuracy is the selection of a suitable threshold. A total of 11 thresholds from 0 to 0.5 at 0.05 intervals were tested for experiments, and a threshold of 0.3 was finally selected to achieve the highest accuracy. In the practical water extraction, NDWI cannot achieve such high accuracy because the threshold is hardly selected for a large-scale mission. RF achieves the lowest accuracy, because its features used for water extraction are manually defined and selected, which cannot meet the requirements of different types of water extraction. The remaining three methods, FCN, PSPNet, and DeepLabv3+, achieve approximate accuracy for water extraction.

Table 6. Accuracy assessment results of different methods on test set (bold numbers refer to the highest values).

Download CSV Display Table

Similarly, to compare the extraction accuracy of these methods on small water bodies, the four evaluation metrics calculated on regions 1 and 2 are shown in . The MSFENet still achieves the highest F1-score and coefficient in regions 1 and 2, indicating that the MSFENet can achieve the best extraction results for small water bodies. With the multi-layer extraction capability of SC, the F1-score and kappa coefficient of UNet are only 4.28% and 4.4% lower than those of the MSFENet in region 1, respectively, and are only 1.49% and 1.57% lower than those of the MSFENet in region 2, respectively. DeepLabv3+, on the other hand, with the multi-scale features extraction capability of ASPP, also achieves high F1-score and kappa coefficient in regions 1 and 2. In contrast, NDWI method achieves the lowest accuracy for water extraction in region 1, and the accuracy in region 2 is also not very high, indicating the poor extraction ability of NDWI for small water bodies. The remaining three methods, RF, FCN, and PSPNet, are also much less capable of extracting small water bodies than the MSFENet. shows the results of different methods, demonstrating the same results as in , where the MSFENet has a stronger extraction capability for small water bodies as well as being able to obtain more accurate water boundaries.

Figure 11. Water extraction results with different methods. (a) Image, (b) Ground truth, (c) NDWI, (d) RF, (e) FCN, (f) PSPNet, (g) UNet, (h) DeepLabv3+, and (i) MSFENet.

Table 7. Accuracy assessment results with different methods in region 1 (bold numbers refer to the highest values).

Download CSV Display Table

Table 8. Accuracy assessment results with different methods in region 2 (bold numbers refer to the highest values).

Download CSV Display Table

4.5 Spectral variability analysis

Spectral variation of the same category of ground objects brings great challenges to water extraction from remote sensing images. To check the water extraction stability of the proposed MSFENet under the spectral variation, a simulated image is generated by replacing the second band with the third band of the GF-2 image in test set (). The MSFENet is trained on the training set, tested on simulated image, and compared with four deep learning models, FCN, PSPNet, UNet, and DeepLabv3+.

Figure 12. The GF-2 image in the test set (a) and simulated image (b).

shows the values of four evaluation metrics of different methods on the simulated image. After the spectral variation, the F1-score and kappa coefficient of all methods have a large decrease, but the MSFENet is still able to achieve the highest F1-score and kappa coefficient. This is mainly because MSFE in the MSFENet can extract multi-layer multi-scale features of water bodies. The spectral features changes in spectral change analysis, but the shape of water bodies does not change, and the multi-layer multi-scale features can reflect the shape characteristics of water bodies. DeepLabv3+ and PSPNet, two networks that can extract multi-scale features using ASPP and PPM respectively, can also achieve high accuracy. But FCN, a network that can only extract single-scale features, achieve a lower accuracy. The reason for the lowest accuracy of UNet is that its encoder is not ResNet34, which has strong feature extraction ability, but a series of convolutional layers, which cannot effectively cope with spectral variations. When the whole area of the simulated image is evaluated, the MSFENet does not achieve a significant advantage in accuracy. But when two regions of mainly small water bodies are evaluated, the MSFENet not only achieves the highest accuracy of water extraction, but also has a significant improvement from the second highest accuracy of water extraction. As shown in , the F1-score and kappa coefficient of the MSFENet are 14.28% and 14.58% higher than the second highest F1-score and kappa coefficient in region 1, 20.18% and 20.9% higher than the second highest F1-score and kappa coefficient in region 2. This indicates that the MSFENet is more stable than other methods and thus can extract more robustness features for small water bodies. The water extraction results of different methods on the simulated image are also shown in . It is obvious that the MSFENet can obtain better results for water extraction than other methods, which is consistent with the results of accuracy assessment.

Figure 13. Water extraction results of different methods on the simulated image. (a) Image, (b) Ground truth, (c) FCN, (d) PSPNet, (e) UNet, (f) DeepLabv3+ and (g) MSFENet.

Table 9. Accuracy assessment results of different methods on the simulated image (bold numbers refer to the highest values).

Download CSV Display Table

Table 10. Accuracy assessment results of different methods on the region 1 of the simulated image (bold numbers refer to the highest values).

Download CSV Display Table

Table 11. Accuracy assessment results of different methods on the region 1 of the simulated image (bold numbers refer to the highest values).

Download CSV Display Table

4.6 Validation on LoveDA

To further validate the effectiveness of the MSFENet, experiments were performed to compare the water extraction accuracy of the MSFENet, FCN, PSPNet, UNet and DeepLabv3+ on LoveDA (Wang et al. Citation2021). LoveDA contains 5987 high-resolution images with size of 1024 $\times$ 1024, including 2522 images in training set, 1669 images in validation set and 1796 images in test set. The spatial resolution of these images is 0.3 m, and all images with red, green, blue bands are obtained from Google Earth Platform. The images in LoveDA are in Nanjing, Changzhou, and Wuhan, China, and cover both urban and rural areas. Seven land cover types are included in LoveDA, which are background, buildings, roads, water, barren, forests, and agriculture. Since the ground truth is not provided in the test set of LoveDA, validation set is divided into two random equal parts, one as the new validation set and the other as the new test set. All the images are cropped into 512 $\times$ 512 patches, finally 10,088 images are used for training, 3336 images for validation and 3340 images for testing. The experimental settings are as same as experiments on GID.

shows the values of four evaluation metrics of different methods on LoveDA. All five methods achieve a low accuracy of water extraction, because the images in LoveDA only contain R, G, and B three bands, unlike the images in GID with NIR bands. Another issue is that labels in GID images are more accurate than LoveDA as all the labels in GID were produced manually. Therefore, mislabeling of LoveDA causes accuracy drops of all models. Moreover, the spatial resolution of images in LoveDA is 0.3 m, which is finer than 3 m of GID images. Fine spatial resolution will increase heterogeneity in landscape, resulting in rising difficulty in objection identification. Although all methods achieve a low accuracy, the MSFENet still outperforms other methods with highest accuracy in LoveDA. Improvements are 1.35% and 1.46% of F1-score and kappa coefficient compared to DeepLabv3+. UNet produces the lowest accuracy as the backbone of UNet is not ResNet34, which leads to no pre-trained parameters to load in the training step. shows the results of different methods in LoveDA, and the MSFENet achieves the best results both in terms of boundaries and accuracy.

Figure 14. Water extraction results of different methods in LoveDA. (a) Image, (b) Ground truth, (c) FCN, (d) PSPNet, (e) UNet, (f) DeepLabv3+, and (g) MSFENet.

Table 12. Accuracy assessment results of different methods in LoveDA (bold numbers refer to the highest values).

Download CSV Display Table

5 Conclusions

This paper proposed a multi-scale features extraction network (MSFENet) with CL from optical high-resolution remote sensing imagery water extraction. In the MSFENet, a MSFE is designed to extract multi-layer multi-scale features of water bodies, and then these features are combined to obtain the final water extraction results. Compared with other methods, the MSFENet achieves the best performance both in GID and LoveDA. Moreover, the recall, F1-score, and kappa coefficient are improved by 9.49%, 4.28%, and 4.4% in region 1, as well as 4.93%, 1.49% and 1.12% in region 2, compared with UNet, which achieves the second highest F1-score and kappa coefficient on test set. These results indicate that the MSFENet can effectively improve the extraction accuracy of small water bodies. In addition, the results of spectral variability analysis show that the MSFENet is more stable than other neural networks, especially for small water body extraction. The recall, F1-score, and kappa coefficient are improved by 21.61%, 14.28%, and 14.58% in region 1, as well as 26.14%, 20.18%, and 20.9% in region 2, compared with PSPNet, which achieves the second highest F1-score and kappa coefficient in regions 1 and 2 of the simulated image. Furthermore, CL cannot further improve the water extraction accuracy in the case of sufficient sample size but can bring a huge improvement in the case of insufficient sample size. In future work, the structure of MSFENet will be continuously optimized, and a larger test set will be used to test the ability of MSFENet for water extraction and to explore the contribution of CL to the model.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The GF-2 images are freely available as follows, Gaofen Image Dataset (GID): https://x-ytong.github.io/project/GID.html. The LoveDA are freely available as follows: https://github.com/Junjue-Wang/LoveDA And the relabeled images and codes that support the findings of this study are available from the corresponding author, upon reasonable request.

Additional information

Funding

This work was supported by the National Natural Science Foundation of China under Grant 41871372

References

Bachman, P., R. D. Hjelm, and W. Buchwalter. 2019. “Learning Representations by Maximizing Mutual Information Across Views.” Paper Presented at the Conference on Neural Information Processing Systems, Vancouver, CANADA, December 08-14.
Google Scholar
Bai, L. B., S. H. Du, X. Y. Zhang, H. Y. Wang, B. Liu, and S. Ouyang. 2022. “Domain Adaptation for Remote Sensing Image Semantic Segmentation: An Integrated Approach of Contrastive Learning and Adversarial Learning.” IEEE Transactions on Geoscience and Remote Sensing 60: 5628313. doi:10.1109/TGRS.2022.3198972.
Web of Science ®Google Scholar
Cao, R. L., C. J. Li, L. Y. Liu, J. H. Wang, and G. J. Yan. 2008. “Extracting Miyun Reservoir’s Water Area and Monitoring Its Change Based on a Revised Normalized Different Water Index.” Science of Surveying and Mapping 33 (2): 158–20. doi:10.3771/j.issn.1009-2307.2008.02.054.
Google Scholar
Chaitanya, K., E. Erdil, N. Karani, and K. Ender. 2020. “Contrastive Learning of Global and Local Features for Medical Image Segmentation with Limited Annotation.” Paper Presented at the Conference on Neural Information Processing Systems, Vancouver, CANADA, December 02-08.
Google Scholar
Chen, W. Q., J. L. Ding, Y. H. Li, and Z. Y. Niu. 2015. “Extraction of Water Information Based on China-Made GF-1 Remote Sensing Image.” Resources Science 37 (6): 1166–1172.
Google Scholar
Chen, Y., R. S. Fan, X. C. Yang, J. X. Wang, and A. Latif. 2018. “Extraction of Urban Water Bodies from High-Resolution Remote-Sensing Imagery Using Deep Learning.” Water 10 (5): 585. doi:10.3390/w10050585.
Web of Science ®Google Scholar
Chen, Y. S., H. L. Jiang, C. Y. Li, X. P. Jia, and P. Ghamisi. 2016. “Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks.” IEEE Transactions on Geoscience and Remote Sensing 54 (10): 6232–6251. doi:10.1109/TGRS.2016.2584107.
Web of Science ®Google Scholar
Chen, L. C., G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2018. “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.” IEEE Transactions on Pattern Analysis and Machine Learning 40 (4): 834–848. doi:10.1109/TPAMI.2017.2699184.
PubMed Web of Science ®Google Scholar
Chen, Y., L. L. Tang, Z. H. Kan, M. Bilal, and Q. Q. Li. 2020. “A Novel Water Body Extraction Neural Network (WBE-NN) for Optical High-Resolution Multispectral Imagery.” Journal of Hydrology 588: 125092. doi:10.1016/j.jhydrol.2020.125092.
Web of Science ®Google Scholar
Chen, Z., T. Zhang, and C. Ouyang. 2018. “End-To-End Airplane Detection Using Transfer Learning in Remote Sensing Images.” Remote Sensing 10 (1): 139. doi:10.3390/rs10010139.
Web of Science ®Google Scholar
Chen, L. C., Y. K. Zhu, G. Papandreou, F. Schroff, and H. Adam. 2018. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” Paper Presented at the European Conference on Computer Vision, Munich, Germany, September 08-14. doi:10.1007/978-3-030-01234-2_49.
Google Scholar
Cui, Q. L., M. Q. Wang, and Y. J. Huang. 2022. “Water Information Extraction in Shanghai by Integrating Random Forest Model and Six Water Indices.” Bulletin of Surveying and Mapping 2: 106–109. doi:10.13474/j.cnki.11-2246.2022.0052.
Google Scholar
De Lima, R. P., and K. Marfurt. 2020. “Convolutional Neural Network for Remote-Sensing Scene Classification: Transfer Learning Analysis.” Remote Sensing 12 (1): 86. doi:10.3390/rs12010086.
Web of Science ®Google Scholar
Ding, F. 2009. “A New Method for Fast Information Extraction of Water Bodies Using Remotely Sensed Data.” Remote Sensing Technology and Application 24 (2): 167–171. doi:10.11873/j.issn.1004-0323.2009.2.167.
Google Scholar
Du, N. R., H. Ottens, and R. Sliuzas. 2010. “Spatial Impact of Urban Expansion on Surface Water Bodies—a Case Study of Wuhan, China.” Landscape and Urban Planning 94 (3–4): 175–185. doi:10.1016/j.landurbplan.2009.10.002.
Web of Science ®Google Scholar
Feyisa, G. L., H. Meilby, R. Fensholt, and S. R. Proud. 2014. “Automated Water Extraction Index: A New Technique for Surface Water Mapping Using Landsat Imagery.” Remote sensing of environment 140: 23–35. doi:10.1016/j.rse.2013.08.029.
Web of Science ®Google Scholar
Fu, J., J. Z. Wang, and J. R. Li. 2008. “Study on the Automatic Extraction of Water Body from TM Image Using Decision Tree Algorithm.” Paper Presented at the International Symposium on Photoelectronic Detection and Imaging, Beijing: Peoples R. China, September 09-12. doi:10.1117/12.790602.
Google Scholar
He, K. M., H. Q. Fan, Y. X. Wu, S. N. Xie, and R. Girshick. 2020. “Momentum Contrast for Unsupervised Visual Representation Learning.” Paper Presented at the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 14–19. doi:10.1109/CVPR42600.2020.00975.
Google Scholar
He, K. M., X. Y. Zhang, S. Q. Ren, and J. Sun. 2015. “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9): 1904–1916. doi:10.1109/TPAMI.2015.2389824.
PubMed Web of Science ®Google Scholar
Huang, G., Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. “Densely Connected Convolutional Networks.” Paper Presented at the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 21-26. doi:10.1109/CVPR.2017.243.
Google Scholar
Liu, Y. F., Y. F. Zhong, and Q. Q. Qin. 2018. “Scene Classification Based on Multiscale Convolutional Neural Network.” IEEE Transactions on Geoscience and Remote Sensing 56 (12): 7109–7121. doi:10.1109/TGRS.2018.2848473.
Web of Science ®Google Scholar
Li, J. J., C. Wang, L. Xu, F. Wu, H. Zhang, and B. Zhang. 2021. “Multitemporal Water Extraction of Dongting Lake and Poyang Lake Based on an Automatic Water Extraction and Dynamic Monitoring Framework.” Remote Sensing 13 (5): 865. doi:10.3390/rs13050865.
Web of Science ®Google Scholar
Li, M. Y., P. H. Wu, B. Wang, and H. L. Park, Y., Y. Hui, L. Wu. 2021. “A Deep Learning Method of Water Body Extraction from High Resolution Remote Sensing Images with Multisensors.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14: 3120–3132. doi:10.1109/JSTARS.2021.3060769.
Web of Science ®Google Scholar
Lu, M., L. Y. Fang, M. X. Li, B. Zhang, Y. Zhang, and P. Ghamisi. 2022. “NFANet: A Novel Method for Weakly Supervised Water Extraction from High-Resolution Remote-Sensing Imagery.” IEEE Transactions on Geoscience and Remote Sensing 60: 5617114. doi:10.1109/TGRS.2022.3140323.
Google Scholar
McFeeter, S. K. 1996. “The Use of the Normalized Difference Water Index (NDWI) in the Delineation of Open Water Features.” International Journal of Remote Sensing 17 (7): 1425–1432. doi:10.1080/01431169608948714.
Web of Science ®Google Scholar
MMSC (MMSegmentation Contributors). 2020. “OpenMmlab Semantic Segmentation Toolbox and Benchmark”. Available online: https://github.com/open-mmlab/mmsegmentation.
Google Scholar
Nandi, I., P. K. Srivastava, and K. Shah. 2017. “Floodplain Mapping Through Support Vector Machine and Optical/Infrared Images from Landsat 8 OLI/TRIS Sensors: Case Study from Varanasi.” Water Resources Management 31 (4): 1157–1171. doi:10.1007/s11269-017-1568-y.
Web of Science ®Google Scholar
Nath, R. K., and S. K. Deb. 2010. “Water-Body Area Extraction from High Resolution Satellite Images-An Introduction, Review, and Comparison.” International Journal of Image Processing 3 (6): 353–372.
Google Scholar
Pekel, J. F., A. Cottam, N. Gorelick, and A. S. Belward. 2016. “High-Resolution Mapping of Global Surface Water and Its Long-Term Changes.” Nature 540 (7633): 418. doi:10.1038/nature20584.
PubMed Web of Science ®Google Scholar
Ren, S. Q., K. M. He, R. Girshick, and J. Sun. 2017. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6): 1137–1149. doi:10.1109/TPAMI.2016.2577031.
PubMed Web of Science ®Google Scholar
Ronneberger, O., P. Fischer, and T. Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” Paper Presented at the Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, October 05-09. doi:10.1007/978-3-319-24574-4_28.
Google Scholar
Shelhamer, E., J. Long, and T. Darrell. 2017. “Fully Convolutional Networks for Semantic Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4): 640–651. doi:10.1109/TPAMI.2016.2572683.
PubMed Web of Science ®Google Scholar
Shen, Z. F., L. G. Xia, J. L. Li, J. C. Luo, and X. D. Hu. 2013. “Automatic and High-Precision Extraction of Rivers from Remotely Sensed Images with Gaussian Normalized Water Index.” Journal of Image and Graphics 18 (4): 421–428. doi:10.11834/jig.20130409.
Google Scholar
Su, L. F., Z. X. Li, F. Gao, and M. Yu. 2021. “A Review of Remote Sensing Image Water Extraction.” Remote Sensing for Land and Resources 33 (1): 9–19. doi:10.16251/j.cnki.1009-2307.2018.05.005.
Google Scholar
Tong, X. Y., G. S. Xia, Q. K. Lu, H. F. Shen, S. Y. Li, and S. C. You. 2020. “Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models.” Remote Sensing of Environment 237: 111322. doi:10.1016/j.rse.2019.111322.
Web of Science ®Google Scholar
Wang, B., Z. L. Chen, W. L, X. H. Yang, and Y. Zhou. 2022. “SADA-Net: A Shape Feature Optimization and Multiscale Context Information-Based Water Body Extraction Method for High-Resolution Remote Sensing Images.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15: 1744–1759. doi:10.1109/JSTARS.2022.3146275.
Web of Science ®Google Scholar
Wang, J. J., Z. Zheng, A. L. Ma, X. Y. Lu, and Y. F. Zhong. 2021. “LoveDa: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation”. Paper presented at Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. doi:10.5281/zenodo.5706578.
Google Scholar
Wu, Z. R., Y. J. Xiong, S. X. Yu, and D. H. Lin. 2018. “Unsupervised Feature Learning via Non-Parametric Instance Discrimination.” Paper Presented at the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, June 18–21. doi:10.1109/CVPR.2018.00393.
Google Scholar
Xie, Z. D., Y. T. Lin, Z. Zhang, Y. Cao, T. H. Lin, and H. Hu. 2021. “Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning.” Paper Presented at the Conference on Computer Vision and Pattern Recognition, Electr, Network, June 19-25 Jun. doi:10.1109/CVPR46437.2021.01641.
Google Scholar
Xu, H. Q. 2005. “A Study on Information Extraction of Water Body with the Modified Normalized Difference Water Index (MNDWI).” Journal of Remote Sensing 9 (5): 589–595.
Google Scholar
Yan, P., Y. J. Zhang, and Y. Zhang. 2007. “A Study on Information Extraction of Water System in Semi-Arid Regions with the Enhanced Water Index (EWI) and GIS Based Noise Remove Techniques.” Remote Sensing Information 6: 62–67.
Google Scholar
Zhao, W. Z., S. H. Du, Q. Wang, and W. J. Emery. 2017. “Contextually Guided Very-High-Resolution Imagery Classification with Semantic Segments.” Isprs Journal of Photogrammetry and Remote Sensing 132: 48–60. doi:10.1016/j.isprsjprs.2017.08.011.
Web of Science ®Google Scholar
Zhao, H. S., J. P. Shi, X. J. Qi, X. G. Wang, and J. Y. Jia. 2017. “Pyramid Scene Parsing Network.” Paper Presented at the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 21-26. doi:10.1109/CVPR.2017.660.
Google Scholar
Zhao, X. Y., R. Vemulapalli, P. A. Mansfield, B. Q. Gong, B. Green, L. Shapira, and Y. Wu. 2021. “Contrastive Learning for Label Efficient Semantic Segmentation.” Paper Presented at the Conference on Computer Vision, Electr, Network, October 11-17. doi:10.1109/ICCV48922.2021.01045.
Google Scholar
Zhou, J. X., J. Chen, X. H. Chen, X. L. Zhu, Y. A. Qiu, H. H. Song, Y. H. Rao, C. S. Zhang, X. Cao, and X. H. Cui. 2021. “Sensitivity of Six Typical Spatiotemporal Fusion Methods to Different Influential Factors: A Comparative Study for a Normalized Difference Vegetation Index Time Series Reconstruction.” Remote Sensing of Environment 252: 112130. doi:10.1016/j.rse.2020.112130.
Web of Science ®Google Scholar
Zhou, Y., G. L. Xie, S. X. Wang, F. Wang, and F. T. Wang. 2014. “Information Extraction of Thin Rivers Around Built-Up Lands with False NDWI.” Journal of Geo-Information Science 16 (1): 102–107.
Google Scholar
Zhou, L. M., C. Zheng, H. X. Yan, X. Y. Zuo, Y. Liu, B. J. Qiao, and Y. Yang. 2022. “RepDarknet: A Multi-Branched Detector for Small-Target Detection in Remote Sensing Images.” ISPRS International Journal of Geo-Information 11 (3): 158. doi:10.3390/ijgi11030158.
Web of Science ®Google Scholar

Water extraction from optical high-resolution remote sensing imagery: a multi-scale feature extraction network with contrastive learning

ABSTRACT

1 Introduction