2,160
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Cross-supervised learning for cloud detection

, , &
Article: 2147298 | Received 21 Jul 2022, Accepted 06 Nov 2022, Published online: 03 Jan 2023

ABSTRACT

We present a new learning paradigm, that is, cross-supervised learning, and explore its use for cloud detection. The cross-supervised learning paradigm is characterized by both supervised training and mutually supervised training, and is performed by two base networks. In addition to the individual supervised training for labeled data, the two base networks perform the mutually supervised training using prediction results provided by each other for unlabeled data. Specifically, we develop In-extensive Nets for implementing the base networks. The In-extensive Nets consist of two Intensive Nets and are trained using the cross-supervised learning paradigm. The Intensive Net leverages information from the labeled cloudy images using a focal attention guidance module (FAGM) and a regression block. The cross-supervised learning paradigm empowers the In-extensive Nets to learn from both labeled and unlabeled cloudy images, substantially reducing the number of labeled cloudy images (that tend to cost expensive manual effort) required for training. Experimental results verify that In-extensive Nets perform well and have an obvious advantage in the situations where there are only a few labeled cloudy images available for training. The implementation code for the proposed paradigm is available at https://gitee.com/kang_wu/in-extensive-nets.

1. Introduction

Optical remote sensing images from different sensors are commonly used in the applications like land cover change monitoring and atmospheric variables estimation(Yu et al. Citation2017; Li et al. Citation2022). Clouds, as natural phenomena in the atmosphere, frequently appear in remote sensing images. The cloud coverage of the earth is about 66% (Zhang et al. Citation2004). Covered by clouds, a vast amount of information about the surface of the Earth in optical remote sensing images is lost (Löw et al. Citation2022; Niederheiser et al. Citation2021). Some optical remote sensing images even become invalid data. Cloud detection is a solution to reducing the useless effort of processing these invalid data and can assist the use of optical remote sensing images in an efficient way. Troubled by the complexity of cloud morphology and the diversity of the land-cover (Zhu and Woodcock Citation2012), cloud detection is usually a challenging task. In the remainder of this section, we will review existing cloud detection methods, address the limitations, give our strategies, and explain our contributions.

1.1. Review of cloud detection methods

In this subsection, we review three families of cloud detection methods: threshold-based methods, handcraft feature-based methods, and deep learning-based methods.

The threshold-based methods set thresholds according to the spectral information of the specific satellite for cloud detection. Various threshold-based methods have been proposed, for example, ISCCP cloud mask algorithm (Rossow and Garder Citation1993), APOLLO (AVHRR Processing scheme Over cLouds, Land, and Ocean) method (Kriebel et al. Citation2003), and Fmask method (Zhu, Wang, and Woodcock Citation2015). However, because the thresholds must be set for a specific satellite platform, the threshold-based methods have obvious disadvantages in terms of poor generalizability.

The handcraft feature-based methods detect clouds by implementing a classifier over the handcraft features. Hu, Wang, and Shan (Citation2015), for example, used a random forest algorithm with visual saliency features to produce predicted cloud masks, while Lin et al. (Citation2015) used weighted principal component analysis (PCA) (Wold, Esbensen, and Geladi Citation1987) on pseudo-invariant features (PIFs) (Hadjimitsis, Clayton, and Retails Citation2009) to generate cloud masks from multi-temporal images. However, although these methods achieve good results in some scenarios, they rely heavily on manual features. Plus, it is difficult to design handcraft features to deal with the complexity of cloud morphologies and the diversity of the land-cover. Hence, application scenarios for these handcraft feature-based methods are limited.

In recent years, deep learning techniques (LeCun, Bengio, and Hinton Citation2015) have been widely implemented for cloud detection. Cloud detection is usually converted to a multi-channel image segmentation problem. Most deep learning-based methods use convolutional neural networks (CNNs) (LeCun et al. Citation1998) to automatically extract features from images. Classification operations are then performed to produce high-accuracy predicted cloud masks. For instance, Xie et al. (Citation2017) firstly used CNNs to build a shallow model for cloud detection with relatively good results, while Shi et al. (Citation2017) implemented the CNNs as the feature extractor and support vector machine (SVM) (Noble Citation2006) as the classifier. However, although the introduction of CNNs has led to relatively good results, these methods have difficulty coping with some complex scenarios because their structures are fairly simple. Li et al. (Citation2019) designed a feature pyramid network for detecting clouds that incorporates both low- and high-level features to detect clouds. Jeppesen et al. (Citation2019) introduced an encoder-decoder-based architecture that integrates multi-level features for cloud detection. Yang et al. (Citation2019) proposed an effective cloud detection structure that consists of several paralleled dilated convolution and global average pooling layers to extract the multiscale and global context information from the thumbnails of remote sensing images. Luotamo, Metsämäki, and Klami (Citation2020) developed network structures with dual CNNs configured in a network ensemble strategy (Pan, Shi, and Xu Citation2017) to generate cloud masks. He et al. (Citation2021) implemented a network structure with deformable convolution, achieving a quite good cloud detection results. Luo et al. (Citation2022) proposed a novel lightweight architecture with very small parameters that detects clouds in an efficient way. One thing that all these deep learning methods have in common is that, with complicated architecture, relevant features are extracted efficiently and effectively and good results can be expected if enough labeled cloudy images are available for training. Additionally, unlike the threshold and handcrafted feature methods, the deep learning methods automatically extract features from labeled images and are thus highly generalizable.

1.2. Challenges

Despite generally good results with most of the current methods, there are still challenges to be overcome in cloud detection, which arise from the mechanism of cloud formation, the production of remote sensing cloudy data, the processing of the cloudy data, etc. This study mostly focuses on the challenges associated with how to comprehensively process remote sensing data for the purpose of cloud detection.

In practice, the amount of labeled cloudy images is usually limited because labeling images with pixel-level cloud mask annotations is a time-consuming and expensive task. This situation is caused by the characteristics of clouds in remote sensing images. On the one hand, the size of each complete remote sensing image is usually very large. On the other hand, the complexity of cloud morphology makes it knotty to annotate especially around the boundaries. Conversely, unlabeled cloudy images (i.e. the cloudy images without manually annotated cloud masks) are generally quite easy to obtain. For example, many satellites frequently release cloudy images without manually annotated cloud masks. For this reason, several suggestions have been made for how to workaround the problem of insufficient labeled cloudy images. Guo et al. (Citation2021), for instance, proposed an insightful unsupervised domain adaptation approach which can transfer the trained cloud detection model to another satellite platform without additional cloud masks. Guo et al. (Citation2022) subsequently designed a semi-supervised cloud detection approach to improve cloud detection performance by confining the cloud regions from labeled and unlabeled data to one unified domain.

Although most existing deep learning-based cloud detection methods have achieved promising cloud detection results in accuracy, the implementation of these methods is limited by the excessive reliance on labeled cloudy images (i.e. the cloudy images with cloud masks) for training. In this scenario, the small amounts of labeled cloudy images seem insufficient while the big amount of unlabeled cloudy images appears useless for training cloud detection models. Hence, one big challenge is how to make comprehensive use of both limited numbers of labeled cloudy images and the abundance of unlabeled images for the purpose of effectively training cloud detection models.

Even in the situation where there is only a limited amount of labeled data, most existing cloud detection models tend to be based on general deep structures. Thus, the cloud features in the labeled cloudy images needed for training tend to be underestimated. This raises another challenge, which is how to develop specific net structures that can comprehensively learn the cloud intrinsic features from the limited labeled cloudy images.

1.3. Promising strategies

To address the challenges regarding cloud detection issues presented in Section 1.2, we propose two promising strategies as follows.

To make uses of both the insufficient labeled cloudy images and the abundant unlabeled cloudy images, we develop a new learning paradigm, that is, cross-supervised learning. The cross-supervised learning paradigm learns from labeled data using the traditional supervised training scheme and learns from unlabeled data using a mutually supervised training scheme that is motivated by the mutual guide technique (Tai et al. Citation2022). In contrast to most existing semi-supervised learning methods, which typically only involve one main network to augment the training data, the cross-supervised learning involves two base networks. In addition to the individual supervised training for labeled data, the two base networks perform the mutually supervised training using prediction results provided by each other for unlabeled data. Thus, the two networks learn from both labeled and unlabeled data. Compared with the one main network scheme, the two networks are trained from two perspectives in terms of both supervised learning and mutually supervised learning, rendering a comprehensive learning paradigm.

To overcome the limitation of insufficient labeled cloudy images, we develop a new network, called an Intensive Net, that leverages information from the features and labels of the labeled cloudy images. Built on an encoder-decoder structure, the Intensive Net comprises a focal attention guidance module (FAGM) and a regression block. The FAGM is proposed to leverage information from the features of the encoders. The complexity of the land-cover and the cloud boundaries leads to some ambiguous regions in cloud detection results (He et al. Citation2021). These regions are usually at the cloud boundaries and some complex land-cover areas (Wu et al. Citation2022). Thus, through an attention mechanism, the FAGM gives the decoders the ability to re-weight, and therefore enhance, the features generated by the encoders in these ambiguous regions for better cloud detection. The regression block is proposed to leverage information from the labels of the labeled cloudy images. Unlike the pixel-level cloud mask, the cloud count of the cloud mask is higher-level information. Unlike the pixel-level cloud mask, the number of cloudy regions in an input image, that is, the cloud count, is high-level supervision information Hence, an auxiliary branch is added to the network to regress the cloud count. Similarly, the cloud mask in an input image contains information about the cloud count in that image. So regressing the cloud count in this image makes full use of the potential information contained in the original cloud mask. In this multi-task manner, this additional auxiliary information guides the network to learn better. The word “intensive” implies that the network focuses on the intensive use of information gleaned from the features and labels of the labeled images.

Based on the cross-supervised learning and the Intensive Net, we propose In-extensive Nets for cloud detection. The In-extensive Nets consist of two Intensive Nets and are trained using cross-supervised learning, which enables the In-extensive Nets to learn from both labeled and unlabeled cloudy images. The Intensive Net empowers the In-extensive Nets to leverage the information from the labeled cloudy images. Therefore, both the cross-supervised learning and the Intensive Net alleviate the reliance on labeled cloudy images for training. The word “in-extensive” implies that compared with the Intensive Net, the In-extensive Nets have the capability of learning from not only the labeled cloudy images but also the extensive unlabeled cloudy images by cross-supervised learning.

1.4. Our novel contributions

The contributions of our work are summarized as follows:

  1. We present a new learning paradigm, i.e., cross-supervised learning, that involves training with both labeled data and unlabeled data.

  2. We develop In-extensive Nets for cloud detection. Benefiting from the cross-supervised learning and the in-built Intensive Nets, the In-extensive Nets can make full use of both labeled and unlabeled cloudy images, substantially reducing the number of cloud labels (that tend to cost expensive manual effort) required for training.

  3. We release a cloud detection dataset, called the HY1C-UPC dataset, which consists of data from the coastal zone imager (CZI) of the Chinese HY-1C satellite. We also release our implementation code for public evaluations (see the Abstract).

2. Cross-supervised learning

In this section, we describe the training and testing procedures of the proposed cross-supervised learning paradigm, which involves both supervised training and mutually supervised training.

2.1. Base network

The base network is formulated as follows:

(1) p=F(x;θ),(1)

where F(;θ) denotes a base network, x=[x1,x2,,xD]T denotes an input data sample, D denotes the dimensions of the sample, θ denotes the model parameter, and p denotes the prediction.

In supervised learning, there is a corresponding label y=[y1,y2,,yC]T for the input data x where C is the total number of classes. The aim of the training process is to maintain consistency between p and y. The loss function LS for the supervised learning is as follows:

(2) LS=L(p,y),(2)

where L() denotes the loss function for training. However, limited by expensive labeling costs, the labeled data are not always sufficient for training.

2.2. Cross-supervised learning between two base networks

To overcome the limitation of insufficient labeled data described in Section 2.1, we develop a cross-supervised learning paradigm. The overall process of the cross-supervised learning for K iterations is shown in , and the more detailed process for the k th iteration is shown in . Taking both labeled and unlabeled data as input, the cross-supervised learning is performed by two base networks. Each base network performs individual supervised training for labeled data and mutually supervised training using classification results provided by each other for unlabeled data.

Figure 1. The overall paradigm of the cross-supervised learning. Both the Networks A and B update their parameters by performing supervised training for labeled data and mutually supervised training for unlabeled data in each training iteration.

Figure 1. The overall paradigm of the cross-supervised learning. Both the Networks A and B update their parameters by performing supervised training for labeled data and mutually supervised training for unlabeled data in each training iteration.

Figure 2. Cross-supervised learning at the k th iteration. The high confidence selection operation generates supervisor prediction according to (12). The mutually supervised selection operation generates supervisee prediction according to (13).

Figure 2. Cross-supervised learning at the k th iteration. The high confidence selection operation generates supervisor prediction according to (12). The mutually supervised selection operation generates supervisee prediction according to (13).

Let X={x1,x2,,xN} denote one data batch of N samples, which consists M labeled samples of labeled data XL and NM samples of unlabeled data XU. Thus, X is represented as follows:

(3) X={XL,XU}.(3)

The prediction P for the data batch X is generated as follows:

(4) P=F(X;θ).(4)

The prediction P consists of a prediction PL for the labeled data XL and prediction PU for the unlabeled data XU. Thus, the prediction can also be represented as follows:

(5) P={PL,PU}.(5)

The cross-supervised learning involves two base networks (i.e. networks A and B) with the same architecture. Let F(;θA) denote network A with parameter θA, and F(;θB) denote network B with parameter θB. Predictions for the data X by the two networks are generated as follows:

(6) PA=F(X;θA),(6)
(7) PB=F(X;θB),(7)

where PA denotes the prediction generated by network A for the data X and PB denotes the prediction generated by network B for the data X. The prediction PA consists of the prediction PLA generated by network A for labeled data and prediction PUA generated by network A for unlabeled data. Similarly, the prediction PB consists of the prediction PLB generated by network B for labeled data and prediction PUB generated by network B for unlabeled data. Thus, the PA and PB can also be represented as follows:

(8) PA={PLA,PUA},(8)
(9) PB={PLB,PUB}.(9)

Both PA and PB have two parts. One part is the prediction for labeled data XL (i.e, PLA and PLB). The other part is the prediction for unlabeled data XU (i.e, PUA and PUB).

The input data XL has the label YL. Supervised training is conducted on the predictions for the labeled data. Based on (2), the losses are defined as follows:

(10) LSA=L(PLA,YL),(10)
(11) LSB=L(PLB,YL),(11)

where LSA denotes the supervised loss for network A, and LSB denotes the supervised loss for network B.

In terms of the mutually supervised training, the two networks are mutually trained using high confidence classification results from each other for unlabeled data.

Training network A requires the guidance from network B. As shown in , the high confidence selection on PUB is to select the high confidence classification results as follows:

(12) PHUBB=Tγ(PUB),(12)

where Tγ() denotes an operation that selects out the classification results whose confidence is higher than the threshold γ, PHUBB is referred to as the supervisor prediction generated by network B for network A, and HUB is the high confidence index. Then the supervisee prediction PHUBA is obtained according to the mutually supervised selection as follows:

(13) PHUBA=S(PUA,HUB),(13)

where S() represents the a function that selects the prediction PHUBA from PUA according to the index HUB. Here, the word “supervisee” implies that PHUBA is obtained by finding out the classification results corresponding to the “supervisor” prediction PHUBB according to the index PUB.

The loss LMA for the mutually supervised training of network A is given as follows:

(14) LMA=L(PHUBA,PHUBB).(14)

Similarly, network A guides network B in the same manner, and the loss for the mutually supervised training of network B is given as follows:

(15) LMB=L(PHUAB,PHUAA).(15)

According to (14) and (15), each network provides its high confidence classification results to guide the training of each other. In this way, the two networks learn from the unlabeled data in a mutually supervised manner.

The overall loss for the cross-supervised learning is a composite of those for the supervised training and for the mutually supervised training. Specifically, the overall losses LA and LB (for networks A and B) are calculated as follows:

(16) LA=LSA+LMA,(16)
(17) LB=LSB+LMB.(17)

2.3. Cross-supervised learning from the training data

As shown in , the training process of the cross-supervised learning is performed up to K iterations. Let X={X(1),X(2),,X(K)} denote the whole training data of K data batches. Let θA(k) and θB(k)(k{1,,K}) denote the parameters of networks A and B at the k th iteration. The θA(1) and θB(1) are randomly initialized via a Kaiming initialization (He et al. Citation2015). X(k) is the input data at the k th iteration. The cross-supervised learning of the two networks is operated in terms of both supervised training and mutually supervised training, as described in Section 2.2. The cross-supervised learning is terminated after K iterations of updates when the two networks are comprehensively trained. This results in two networks with the parameters θA(K+1) and θB(K+1).

2.4. Final results via agreement and disambiguity between the two base networks

The final results are generated via agreement and disambiguity between two base networks as shwon in . Let XT denote the testing data. F(XT;θA(K+1)) and F(XT;θB(K+1)) provide the preliminary classification results for the XT. The final classification result for the data sample xTXT is generated by considering both F(xT;θA(K+1)) and F(xT;θB(K+1)). Two main scenarios are considered:

Figure 3. Agreement and disambiguity between two base networks. The agreement and disambiguity operation generates the final results by considering both the preliminary predictions generated by Networks A and B.

Figure 3. Agreement and disambiguity between two base networks. The agreement and disambiguity operation generates the final results by considering both the preliminary predictions generated by Networks A and B.

(1) Agreement: If the preliminary classification results F(xT;θA(K+1)) and F(xT;θB(K+1)) are the same, two networks have agreement on the label assignment for the sample xT. In this situation, the final classification result for the sample is the agreed label.

(2) Disambiguity: If the preliminary classification results F(xT;θA(K+1)) and F(xT;θB(K+1)) are different, ambiguity over which label to assign to the sample xT arises. This ambiguity is eliminated by taking the label with higher confidence as the classification result for the sample.

Algorithm 1: Cross-supervised Learning

3. In-extensive Nets

In this section, we develop In-extensive Nets based on cross-supervised learning for cloud detection. The architecture of the In-extensive Nets is shown in . The In-extensive Nets consist of two Intensive Nets, which are coupled through cross-supervised learning. In this scenario, the In-extensive Nets use both labeled and unlabeled cloudy images to train efficient networks for cloud detection. Section 3.1 provides the detailed architecture of the Intensive Net. The overall of the In-extensive Nets is presented in Section 3.2. Section 3.3 introduces the training of the In-extensive Nets, and Section 3.4 covers the testing. The implementation details of In-extensive Nets are shown in Section 3.4.

Figure 4. The framework of the In-extensive Nets. The In-extensive Nets take the labeled and unlabeled cloudy images as inputs. The included Intensive Nets A and B perform supervised training for labeled cloudy images and mutually supervised training for unlabeled cloudy images.

Figure 4. The framework of the In-extensive Nets. The In-extensive Nets take the labeled and unlabeled cloudy images as inputs. The included Intensive Nets A and B perform supervised training for labeled cloudy images and mutually supervised training for unlabeled cloudy images.

3.1. Intensive Net

The Intensive Net has the architecture of enhanced encoder-decoder as shown in . The encoders are used for feature extraction, and the decoders are used for cloud mask recovery. Between the encoders and decoders, a module called the focal attention guide module (FAGM) generates focal enhanced features from encoders for decoders. After the deepest layer of the encoders, a regression block is deployed to regress the number of cloudy regions in the image, that is, the cloud count.

Figure 5. The architecture of the Intensive Net. The focal attention guide module (FAGM) reweighs the features of ambiguous regions from the encoders. The regression block regresses the cloud count of the cloud masks.

Figure 5. The architecture of the Intensive Net. The focal attention guide module (FAGM) reweighs the features of ambiguous regions from the encoders. The regression block regresses the cloud count of the cloud masks.

To extract the multi-level features from an input cloudy image, a structure like the Efficient-b3 Net is implemented in the four encoders. The Efficient-b3 Net accurately extracts the multi-level features with few parameters by balancing the width and quantity of the bottleneck block (Tan and Le Citation2019).

Four decoders are included in the Intensive Net. Each decoder is composed of one upsample layer and two convolution layers with batch normalization (BN) (Ioffe and Szegedy Citation2015). The decoders gradually recover cloud masks from these concatenate features.

The FAGM is used to connect the encoder and decoder. As shown in , the FAGM takes the pending feature and guidance feature as the inputs and generates the focal enhanced feature. Moreover, the pending feature is the feature that needs to be enhanced. The guidance feature is the feature used for guiding the pending feature. The focal enhanced feature is the result after feature enhancement from a pending feature. Specially, the guidance feature passes through the 1 × 1 convolution layer, the upsample layer, and a sigmoid layer and then a coarse map is generated. The coarse map is a map of the rough cloud detection results. The values in the map range from 0 to 1. The areas with pixel value close to 0.5 are considered to be ambiguous areas. These are areas that the guidance feature found difficult to classify. To highlight these ambiguous areas, we design the focal attention arithmetic as follows:

(18) y=1|x0.5|0.5,(18)

Figure 6. The architecture of the focal attention guide module (FAGM). The pending feature are enhanced by addressing the ambiguous regions considered by guidance feature through the FAGM.

Figure 6. The architecture of the focal attention guide module (FAGM). The pending feature are enhanced by addressing the ambiguous regions considered by guidance feature through the FAGM.

where x denotes the input and y denotes the output. The focal attention arithmetic changes focal points of the coarse map by highlighting the ambiguous areas. The result is a focal attention map – a map that clearly illustrates the extent of the ambiguous areas in the coarse map. The pending feature and the focal attention map are then multiplied in an element-wise manner to derive the focal enhanced feature. Through element-wise multiplication, the ambiguous areas of the guidance feature are enhanced in the pending feature. The Intensive Net includes three FAGMs, as shown in . Each FAGM takes the feature generated by the decoder as the guidance feature and the feature generated by the encoder as the pending feature. The feature generated by the decoder emphasizes the ambiguous areas of the feature generated by the encoder, and the two features are then concatenated to derive the concatenated feature as the input to the lower-level decoder. The FAGM gives the decoders the ability to re-weight the features generated by the encoders so as to enhance the features in ambiguous areas for better cloud mask recovery.

The regression block consists of two fully connected layers that regress the number of cloudy regions in the image, noting that regressing this number requires a global understanding of the whole image. The regression block, as shown in , is deployed at the deepest layer of the Intensive Net. Like the target of cloud detection, the target of cloud count regression is also normalized to [0,1] to balance the training of the two tasks.

The input of the Intensive Net is the normalized cloudy image XRW×H×C, where W denotes the width, H denotes the height, and C denotes the channel. The outputs of the Intensive Net are the predicted cloud mask PRW×H×1 and the predicted cloud count N[0,1] (i.e. the number of cloudy regions). They are obtained as follows:

(19) {P,N}=F(X;θ),(19)

where F(;θ) denotes the Intensive Net and θ denotes the network parameters.

3.2. Overall of in-extensive Nets

As mentioned, the In-extensive Nets consist of two Intensive Nets (i.e. Intensive Net A and Intensive Net B) with the same architecture. These networks are trained according to the cross-supervised learning procedure described in Section 2.2, then tested following the agreement and disambiguity process described in Section 2.4.

3.3. Training of in-extensive Nets

As shown in , the Intensive Nets A and B take one cloudy image batch as their input and generate predicted cloud masks. The cloudy image batch consists of a labeled cloudy image XL and an unlabeled cloudy image XU. The predicted cloud masks, along with the predicted cloud count, are then calculated by the two networks following (19). The predicted cloud masks consist of the predicted cloud mask for the labeled cloudy image (i.e. PLA from Intensive Net A and PLB from Intensive Net B) and the predicted cloud mask for the unlabeled cloudy image (i.e. PUA from Intensive Net A and PUB from Intensive Net B). Similarly, the predicted cloud counts reflect the predicted number of cloudy regions in the labeled cloudy image (i.e. NLA from Intensive Net A and NLB from Intensive Net B) and the predicted number of cloudy regions in the unlabeled cloudy image (i.e. NUA by Intensive Net A and NUB by Intensive Net B). In the cross-supervised learning, the supervised training is performed on the predictions for the labeled cloudy image, and the mutually supervised training is performed on the predictions for the unlabeled cloudy image.

The supervised training for cloud detection involves calculating supervised losses (i.e. LSA for Intensive Net A and LSB for Intensive Net B) between the predicted cloud masks for the labeled cloudy image and the ground truth cloud mask according to (10) and (11). Similarly, the supervised training for cloud count regression involves calculating the supervised losses between the predicted cloud counts for the labeled images and the ground truth cloud counts. However, most of the cloud detection datasets are published without cloud counts, and relabeling the datasets manually would not only be time-consuming but would also probably be inaccurate. Hence, we convert the problem of generating cloud count to the problem of computing the number of connected domains. We use the flood filling algorithm (Smith Citation1979) based on depth-first search (DFS) algorithm (Tarjan Citation1972) to automatically calculate the cloud count from the ground truth cloud mask. The supervised losses for cloud count regression are formulated as follows:

(20) LSNA=L(NLA,Count(YL)),(20)
(21) LSNB=L(NLB,Count(YL)),(21)

where LSNA denotes the supervised loss for cloud count regression of Intensive Net A, LSNB denotes the supervised loss for cloud count regression of Intensive Net B, YL denotes the ground truth cloud mask, and Count() denotes the flood filling algorithm that used for counting the cloud count from ground truth cloud mask YL.

As shown in , the mutually supervised training for cloud detection is implemented by guiding each Intensive Net to calculate mutually supervised loss (i.e. LMA for Intensive Net A or LMB for Intensive Net B) between its supervisee cloud mask and the supervisor cloud mask provided by the other for the unlabeled cloudy image according to (14) and (15). Training the Intensive Net A requires the guidance from the Intensive Net B and vice versa. As shown in , high confidence predictions are selected from PUB to form the supervisor cloud mask in high confidence selection process following (12). One supervisor cloud mask includes the cloud pixels, the background pixels, and any unselected low confidence pixels. Only the cloud pixels and the background pixels are used to supervise the training of the Intensive Net A. To use supervisor cloud mask provided by the Intensive Net B, the mutually supervised selection is performed on PUA. The mutually supervised selection builds supervisee cloud mask by selecting the pixels from PUA according to the position of cloud and background pixels of the supervisor cloud mask provided by the Intensive Net B. The mutually supervised loss LMA is calculated between the supervisee cloud mask generated by the Intensive Net A and the supervisor cloud mask provided by the Intensive Net B. The mutually supervised training process of training the Intensive Net B is similar. The mutually supervised loss LMB is calculated between the supervisee cloud mask generated by the Intensive Net B and the supervisor cloud mask provided by the Intensive Net A. Similar to the mutually supervised training for cloud detection, the mutually supervised training for cloud count regression is performed by guiding each Intensive Net to calculate the mutually supervised loss between its own predicted cloud count and the predicted cloud count provided by the other for unlabeled cloudy images. The mutually supervised losses for cloud count regression are formulated as follows:

(22) LMNA=L(NUA,NUB),(22)
(23) LMNB=L(NUB,NUA),(23)

where LMNA denotes the mutually supervised loss for cloud count regression of Intensive Net A, and LMNB denotes the same loss for Intensive Net B.

Guided by cross-supervised learning, the overall losses of the two Intensive Nets are derived as follows:

(24) LoverallA=λD(LSA+LMA)+λR(LSNA+LMNA),(24)
(25) LoverallB=λD(LSB+LMB)+λR(LSNB+LMNB),(25)

where LoverallA denotes the overall loss for Intensive Net A, LoverallB denotes the overall loss for Intensive Net B, λD denotes the weight for cloud detection task, and λR denotes the weight for cloud count regression task. λD and λR are set by experience. In this study, we set them to 1. Binary cross entropy (BCE) loss is used on the losses for cloud detection (i.e. LSA, LSB, LMA, and LMB), and mean squared error (MSE) loss is used on the losses for cloud count regression (i.e. LSNA, LSNB, LMNA, and LMNB). Thus, benefiting from (24) and (25), the In-extensive Nets learn from both labeled and unlabeled cloudy images a the cross-supervised learning manner.

3.4. Testing of in-extensive Nets

Given one testing cloudy image XT, the two trained Intensive Nets generate preliminary predicted cloud masks PTA and PTB. The final cloud detection results of the In-extensive Nets are generated according to the agreement and disambiguity process in Section 2.4. As mentioned, if cloud detection results for one pixel in PTA and PTB are the same, the final cloud detection result for this pixel is the agreed result. If cloud detection results for one pixel in PTA and PTB are different, the final cloud detection result for this pixel is the result with higher confidence.

3.5. Implementation details of in-extensive Nets

There are some implementation details for training the In-extensive Nets. The encoders adopted from Efficient-b3 Net are pretrained on the ImageNet (Krizhevsky et al. Citation2012) before training. The pretrained models with ImageNet have been proven effective and widely used in the cloud detection field (Guo et al. Citation2020; Yan et al. Citation2018) along with some other remote sensing application (Li et al. Citation2017a). The adaptive motion estimation (Adam) optimizer (Kingma and Ba Citation2014) is used to optimize our networks in the training process. We set the hyperparameters to β1=0.9, β2=0.999, and ε=108. We use a learning rate descent strategy to train our network following:

(26) lre=lr1×(1.0e1E)power,(26)

where e{1,2,,E} denotes the current training epoch, E denotes the total training epoch, and lre denotes the learning rate at the epoch e. We set initial learning rate lr1 to 105, E to 100, and power to 0.5. In this scenario, the learning rate decreases gradually as the training epoch increases.

We use a dynamic threshold strategy to determine the threshold in the high confidence selection process in Section 3.3. Specifically, thresholds are computed for selecting the top ke% of cloud and background pixels in a predicted cloud mask at training epoch e. k increases as the number of training epoch increases via an exponential ramp-up function (Laine and Aila Citation2016), i.e. ke=100×exp(5.0×(1e/Er)2), where Er denotes the total number of ramp-up epochs. We set Er to 30 during training.

4. Data for investigation

The datasets used for the experiments are GF1-WHU (Li et al. Citation2017b), SPARCS (Foga et al. Citation2017), and HY1C-UPC. The GF1-WHU dataset is a public cloud detection dataset for Gaofen-1 WFV images, the SPARCS dataset is a public cloud detection dataset consisting of Landsat8 Operational Land Imager (OLI) images, and the HY1C-UPC dataset is assembled from images from the Chinese HY-1C satellite.

(1) GF1-WHU dataset. The GF1-WHU dataset consists of Gaofen-1 WFV images. The WFV images have a 16-m spatial resolution and four multispectral bands including the visible bands and the near-infrared spectral band. There are 108 Level-2A scenes collected from different global land-cover types with various cloud conditions. The dataset contains two categories, namely “cloud” and “cloud shadow”.

(2) SPARCS dataset. The SPARCS dataset consists of Landsat8 OLI images. The Landsat8 OLI images have 9 bands with hybrid resolutions. The band 8 has a resolution of 15m and other bands are with resolution of 30m. The images of the SPARCS dataset are randomly selected from 2013 to 2014 of the SPARCS dataset. A number of sub-images with the size of 1000 × 1000 from entire Landsat images were used to represent the entire Landsat images. The SPARCS dataset are with 80 Landsat-8 satellite images annotated in seven categories, namely “cloud”, “cloud shadow”, “shadow over water”, “snow/ice”, “water”, “land” and “flooded”.

(3) HY1C-UPC dataset. This is a new dataset we built from images from the Chinese HY1-C satellite. The coastal zone imager (CZI) on the HY1-C satellite has a 50-m spatial resolution with four multi-spectral bands. Table S1 shows details of the bands. Our HY1C-UPC dataset includes 25 scenes from September 2021 to February 2022. The main scenes are collected from the coastal zones as shown in . The observation width of the CZI is large, hence, the dataset includes various terrains, e.g., city, snow, forest, ocean, etc., as shown in Fig S1. The HY1C-UPC dataset contains 8 manually labeled scenes that are labeled by experts with the Photoshop software and 17 unlabeled scenes.

Figure 7. The investigated areas of the HY1C-UPC dataset. The main investigated areas are coastal zones around China.

Figure 7. The investigated areas of the HY1C-UPC dataset. The main investigated areas are coastal zones around China.

In our study, we set up various experiments with different data divisions on the three datasets. The specific experimental data divisions are discussed along with each experiment. The original remote sensing images in the three datasets are quite large, which has the potential to cause memory problems during training. Therefore, we use the sliding window method to crop the large size remote sensing images down into images of small patches. The sliding window method is conducted by cropping small patches from a large image in sequence using a fixed-size window. It is widely used in the field of remote sensing (Han et al. Citation2021). Our window size is 352 × 352. As shown in the first row in Figure S2, in each step, the sliding window is moved by half its size (i.e. 176 pixels) to crop the training images. In this way, the training images are expanded. As shown in the second row in Figure S2, in each step, the sliding window is moved by full its size (i.e. 352 pixels) to crop the testing images. In this way, we can evaluate the results of the entire dataset by evaluating the results of patches. For all experiments, the images of the testing set are totally unseen during training processes.

5. Experimental results

In this section, we present the detailed experimental results of our proposed methods. We first introduce the experimental setup and evaluation metrics. We then present an ablation study on the key parts of Intensive Net, evaluating the effectiveness of the proposed Intensive Net and In-extensive Nets quantitatively and qualitatively. We finalize the section with discussions of the stability, thresholding strategy, efficiency, limitations, and promising future work of the proposed methods.

5.1. Experimental setup

Our Intensive Net and In-extensive Nets are implemented with Pytorch 1.8.1 (Paszke et al. Citation2019) on CentOS 7.9. A platform with two Intel Xeon Gold 5218 R CPUs and one NVIDIA Geforce 2080Ti GPU is used to deploy the experimental environment.

5.2. Evaluation metrics

We use the metrics of overall accuracy (OA), F1-score, and mean intersection over union (MIoU) for evaluating the accuracy of the proposed methods. Specially, F1-score is the harmonic mean between precision and recall. It measures the methods from a more comprehensive perspective. Let TP denote the number of true positive (TP) cloud detection results, TN denote the number of true negative (TN) cloud detection results, FP denote the number of false positive (FP) cloud detection results, and FN denote the number of false negative (FN) cloud detection results. Each metric was calculated as follows:

(27) OA=TP+TNTP+TN+FP+FN,(27)
(28) F1=2×TPTP+FN×TPTP+FPTPTP+FN+TPTP+FP,(28)
(29) MIoU=TPTP+FP+FN.(29)

To evaluate performance visually, we use a strategy that depicts the predicted cloud detection results in four colors. The black pixels represent TN results; white represents TP; magenta indicates FP; and green indicates FN. Both the magenta and green pixels are considered misclassifications.

5.3. Ablation studies

We conduct an ablation study on the HY1C-UPC dataset to validate the effectiveness of the main parts of the Intensive Net. The training and testing data are divided randomly according to a ratio of 6:4 labeled images across the 8 labeled scenes of the HY1C-UPC dataset. Our baseline is a pure encoder-decoder network without the proposed FAGM and regression block. The FAGM and the regression block are added to the baseline gradually to construct an Intensive Net (i.e. baseline + FAGM + regression block). As shown in , adding the FAGM to the baseline improves significantly the performance at all of the metrics. Performance is further improved by adding the regression block.

Table 1. Ablation study of In-extensive Nets on the HY1C-UPC dataset.

The visual comparison is shown in . Here, the input image is the ocean scene covered by broken clouds. The baseline detects the main bodies of the clouds but the results at the cloud boundaries are not good. Performance at the cloud boundaries steadily improves gradually when the FAGM and the regression block are added significantly. Specially, the cloud detection results especially at the cloud boundaries are significantly improved by adding the FAGM. It indicates that FAGM can indeed enhance the performance on ambiguous regions such as cloud boundaries. These results validate the effectiveness of the main parts of the intensive net.

Figure 8. Visual comparison of the baseline, baseline + FAGM, and Intensive Net (Baseline + FAGM + regression block) on the HY1C-UPC dataset for ablation study.

Figure 8. Visual comparison of the baseline, baseline + FAGM, and Intensive Net (Baseline + FAGM + regression block) on the HY1C-UPC dataset for ablation study.

5.4. Comparison with state-of-the-art methods

We compare the proposed Intensive Net and In-extensive Nets with PSPNet (Zhao et al. Citation2017), DeepLab V3+ (Chen et al. Citation2017) RS-Net (Jeppesen et al. Citation2019), Cloud-Net (Mohajerani and Saeedi Citation2019), CDNet V2 (Guo et al. Citation2020), Mean Teacher (Tarvainen and Valpola Citation2017), and SSCD-Net (Guo et al. Citation2022) by accuracy assessment and visual comparison. RS-Net, Cloud-Net, CDNet V2, and SSCD-Net are cloud detection methods. PSPNet and DeepLab V3+ are general image segmentation methods. The metrics for accuracy assessment and visualization strategy for visual comparison are both outlined in Section 5.2.

5.4.1. Intensive Net

To validate the effectiveness of the Intensive Net with sufficient labeled training data, we conduct experiments on the GF1-WHU, HY1C-UPC, and SPARCS datasets. The training data and testing data are divided randomly according to the ratio of 6:4 on the three datasets. The experimental data on the HY1C-UPC dataset follows the setting from Section 5.3.

The results of accuracy assessment are shown in . The results show that our Intensive Net gets the best performance at the F1-score, MIoU, and OA on the three datasets.

Table 2. Accuracy assessment of PSPNet, DeepLab V3+, RS-Net, Cloud-Net, CDNet V2, and Intensive Net on the GF1-WHU, HY1C-UPC, and SPARCS datasets.

The results of visual comparison on the GF1-WHU dataset are shown in Figure S3. The input image is a city scene with clouds and rivers. The PSPNet, RS-Net, and CDNet V2 misclassify the rivers as clouds. The Cloud-Net, DeepLab V3+, and our Intensive Net yield relatively accurate results. Compared to the Cloud-Net and DeepLab V3+, the Intensive Net has fewer misclassification pixels around the cloud boundaries and so performs best.

The results of the accuracy evaluation and visual comparison show that four cloud detection methods (i.e. RS-Net, Cloud-Net, CDNet V2, and Intensive Net) always perform better than these general methods (i.e. PSPNet and DeepLab V3+). As a milestone framework for semantic segmentation, DeepLab V3+ also performs well in cloud detection. Unlike competing cloud detection methods which tend to design structures to use only cloud features, the Intensive Net makes full use of both the features and label information from the labeled cloudy images through FAGM and regression block. Therefore, our method yields the best results among all methods.

5.4.2. In-extensive Nets

To validate the effectiveness of the In-extensive Nets, we conduct experiments on the GF1-WHU, SPARCS, and HY1C-UPC datasets.

The experiments on the HY1C-UPC and SPARCS datasets are conducted to evaluate the effectiveness of our In-extensive Nets in the case of adding real unlabeled cloudy images for training. We add 17 unlabeled scenes in the HY1C-UPC dataset and 7 unlabeled scenes from Landsat 8 OLI images in Table S2 to augment the training process in Section 5.4.1. Thus, the training set comprises both labeled and unlabeled images for using cross-supervised learning. The accuracy assessment results are shown in . The results show that our In-extensive Nets yield the best performance on the HY1C-UPC and SPARCS datasets. Moreover, the In-extensive Nets have better performance at all of the metrics than the single Intensive Net, which validates the effectiveness of the cross-supervised learning. Compared with two semi-supervised learning methods, that is, Mean Teacher and SSCD-Net, our In-extensive Nets also have obvious advantages.

Table 3. Accuracy assessment of PSPNet, DeepLab V3+, RS-Net, Cloud-Net, CDNet V2, Intensive Net, Mean Teacher, SSCD-Net, and In-extensive Nets on the HY1C-UPC and SPARCS datasets.

The results of visual comparison are shown in Figure S4. The input image is the forest scene covered by the both thick and thin clouds. Intensive Net, Mean Teacher, SSCD-Net, and In-extensive Nets capture the thin clouds. Compared to the competing methods, the In-extensive Nets perform better at cloud boundaries. Benefiting from cross-supervised learning, the In-extensive Nets achieve the most visually accurate cloud detection results in this scenario.

The experiments on the GF1-WHU dataset evaluate the effectiveness of our In-extensive Nets under varying degrees of labeled data insufficiency. The basic training and testing sets follow the setting outlined in Section 5.4.1. We randomly select scenes from the basic training set at different rates (i.e. 50% (32 scenes), 25% (16 scenes), and 12.50% (8 scenes)). The images of the selected scenes are used to perform supervised training, and the remained images are used to perform mutually supervised training without labels. The results of accuracy assessment on the GF1-WHU dataset are shown in , which show that our In-extensive Nets deliver the best performance at all of the labeled data rates on the GF1-WHU dataset. In particular, In-extensive Nets consist of two Intensive Nets and perform mutually supervised training with unlabeled data, and this framework significantly outperforms a single Intensive Net.

Table 4. Accuracy assessment of PSPNet, DeepLab V3+, RS-Net, Cloud-Net, CDNet V2, Intensive Net, Mean Teacher, SSCD-Net, and In-extensive Nets at different labeled data rates on the GF1-WHU datasets.

shows visual comparison of the GF1-WHU dataset at the labeled data rate of 12.5%. The first row of the is a mountain scene covered by clouds. All methods misclassify the ridges as clouds, except the SSCD-Net and In-extensive Nets due to lack sufficient labeled training data. Compared to SSCD-Net, the In-extensive Nets yield clear cloud boundaries. Our In-extensive Nets get the most accurate cloud detection results in this scenario. The second row of the is the snow scene covered by clouds. The PSPNet, DeepLab V3+, RS-Net and CDNet V2 cannot capture the clouds in this complex situation. The Intensive Net and Mean Teacher captures the clouds but misclassify the snows as clouds too. Comparing with SSCD-Net, our In-extensive Nets show better performance on cloudy boundaries. The results of the accuracy evaluation and visual comparison show that Our In-extensive Nets produce the most accurate result even in situations of extremely label insufficiency. Without introducing unlabeled data, the PSPNet, DeepLab V3+, RS-Net, Cloud-Net, and CDNet V2+ show poor cloud detection performance when the training data is insufficient. The three methods that perform relatively well include In-extensive Nets, SSCD-Net, and Mean Teacher. All these methods consider insufficient data, so what these results illustrate is that designing network models that consider the problem of insufficient training data is highly important. Unlike Mean Teacher and SSCD-Net, our In-extensive Nets leverage information from both the labeled and unlabeled cloudy images to handle the problem of labeled data insufficiency for cloud detection. Specially, the cross-supervised learning paradigm in In-extensive Nets empowers the two included two Intensive Nets the ability to mutually learn from the unlabeled cloudy images. Further, the included two Intensive Nets are designed to make full use of the information of features and labels through the FAGM and regression block for the labeled cloudy images.

Figure 9. Visual comparison of PSPNet, DeepLab V3+, RS-Net, Cloud-Net, CDNet V2, Intensive Net, Mean Teacher, SSCD-Net and In-extensive Nets on the GF1-WHU dataset at the labeled data of 12.50%.

Figure 9. Visual comparison of PSPNet, DeepLab V3+, RS-Net, Cloud-Net, CDNet V2, Intensive Net, Mean Teacher, SSCD-Net and In-extensive Nets on the GF1-WHU dataset at the labeled data of 12.50%.

5.5. Discussions

5.5.1. Stability

To further discuss the stability of our In-extensive Nets under different degrees of data insufficiency, we plot line graphs showing the cloud detection accuracy of different methods at different labeled data rates according to . The graphs are shown in .

Figure 10. Comparison of the detection accuracy of different methods as cutting down the labeled data rate on the GF1-WHU dataset. (a) F1-score, (b) MIoU, and (c) OA.

Figure 10. Comparison of the detection accuracy of different methods as cutting down the labeled data rate on the GF1-WHU dataset. (a) F1-score, (b) MIoU, and (c) OA.

The results show that the performance of all of the methods get declines at different extents as the number of the labeled cloudy images cut down. As the amount of label data is halved, RS-Net, Mean Teacher, SSCD-Net, and our In-extensive Nets decline more slowly than other methods. Moreover, the performance of our In-extensive Nets declines at the slightest extent among all the methods. The stable performance of RS-Net comes from its relatively simple structure, which is not easy to overfit in the case of insufficient labeled data. But the simple structure also leads to generally poor detection performance. The stable performance of these three methods, including Mean Teacher, SSCD-Net, and our In-extensive Nets, stems from being able to use unlabeled data for training. Due to leverage information from both the labeled and unlabeled cloudy images with the cross-supervised learning paradigm and the included Intensive Nets, our In-extensive Nets show the best stability on all of the metrics under different degrees of data insufficiency.

5.5.2. Threshold for high confidence selection

High confidence selection is a key process in our In-extensive Nets. When training In-extensive Nets, the dynamic threshold strategy outlined in Section 3.5 generates the threshold for high confidence selection. To verify the effectiveness of the dynamic threshold strategy, we add the experiments on the Intensive Nets trained with fixed thresholds of 0.5 and 0.7 in the HY1C-UPC dataset. The results, shown in , demonstrate that performance of the dynamic threshold strategy is better than fixed thresholds of 0.5 and 0.7. At beginning of the training process, two networks both have poor cloud detection ability. If too many predictions are used for conducting cross-supervised training at this time, some noise may be introduced to the training process. With the dynamic threshold strategy, only a few predictions that have extreme high confidence are used for cross-supervised training initially. As the training goes on, the performance of two networks is getting better. Meanwhile, more and more predictions are used for conducting cross-supervised training. Thus, the dynamic threshold strategy fits the training process of In-extensive Nets well and helps the In-extensive Nets get better performance.

Table 5. Accuracy assessment of In-extensive Nets with different threshold strategy on the HY1C-UPC dataset.

5.5.3. The effectiveness of cross-supervised learning

To highlight the effectiveness of the cross-supervised learning in the In-extensive Nets, we evaluate the MIoU of Intensive Nets A and B, the final results of In-extensive Nets, and a single Intensive Net peer epoch as shown in . The results show that the Intensive Nets A and B of the In-extensive Nets always show better performance than a single Intensive Net. It indicates that the performance of Intensive Nets A and B get improved by cross-supervised learning. Specially, the final results via agreement and disambiguity always obtain better performance than the included Intensive Nets A and B. It indicates the effectiveness of agreement and disambiguity process.

Figure 11. The accuracy assessment of single Intensive Net, Intensive Net A, Intensive Net B, and final results via agreement and disambiguity every epoch.

Figure 11. The accuracy assessment of single Intensive Net, Intensive Net A, Intensive Net B, and final results via agreement and disambiguity every epoch.

We also visualize some intermediate results from In-extensive Nets to explore the process of cross-supervised learning in one training iteration as shown in . is one training iteration for one unlabeled cloudy image selected from the 20th training epoch. The input image is shown in the first column of . Intensive Nets A and B have different predicted cloud masks for input unlabeled image due to different learning situations, as shown in the second column of . After high-confidence selection, both networks generate supervisor cloud masks by selecting some high-confidence predictions as shown in the third column of . Gray pixels represent unselected pixels, and white and black pixels are considered cloud pixels and background pixels, respectively. These white and black pixels are used to guide each other for training. The supervisee cloud masks generated by the two networks through mutual supervised selection are shown in the fourth column of . These intermediate results show that the prediction results of the two networks for some cloud areas are different, especially on some small-scale thin clouds. The existence of these differences enables the two networks to achieve complementarity through cross-supervised learning, and finally, the cloud detection performance of both networks is improved.

Figure 12. The intermediate results from In-extensive Nets in one training iteration from the 20th training epoch.

Figure 12. The intermediate results from In-extensive Nets in one training iteration from the 20th training epoch.

5.5.4. Efficiency

Training efficiency is important for a cloud detection method. To evaluate the training efficiency of the proposed methods. We compared the parameters and computation of the In-extensive Nets against the other methods. The comparisons are shown in . The results show that the parameters of our methods are even lower than many single models. Because of the simple and efficient structure of the Intensive Net, our In-extensive Nets still have small parameters even though two Intensive Nets are included. The parameters of the In-extensive Nets would be decreased for further application deployment. Using technology like knowledge distillation to integrate the knowledge of two networks into one network is one of our future efforts.

Table 6. Parameter and computation comparisons of PSPNet, SegNet, RS-Net, Cloud-Net, CDNet V2, Intensive Net, Mean Teacher, SSCD-Net, and In-extensive Nets.

5.5.5. Limitations

We observe the misclassification areas in the prediction of our In-extensive Nets in the HY1C-UPC dataset. Most of these areas involve regions where clouds and ice floes are mixed as shown in . Ice floes around the coastal line have similar characteristics to clouds and so it is difficult to distinguish the between two. Improving the cloud detection performance in scenes covered by ice floes will be a future research direction.

Figure 13. Some failure cases of the In-extensive Nets in the scenes where clouds and ice floes are mixed.

Figure 13. Some failure cases of the In-extensive Nets in the scenes where clouds and ice floes are mixed.

5.5.6. Promising future work

(1) In the agreement and disambiguity process of our methods, the contributions of two networks to the final result are the same. However, the cloud detection performance of the two networks is usually different; there is always one network that performs better than the other. Allocating the weights of the two networks adaptively in the agreement and disambiguity process is a promising direction.

(2) Cross-supervised learning comprises two base networks. A promising generalized form of cross-supervised learning is to extend the number of base networks to n. In this scenario, each base network performs supervised training for labeled data and generalized mutually supervised training for unlabeled data. The supervisor prediction might be the consistent predictions by other n1 networks. In this way, the knowledge learned by each network through mutually supervised training may be more accurate, and perhaps performs better.

(3) In this particular implementation of cross-supervised learning, the two base networks have the same architecture but different initialization parameters. The learning abilities of the two networks with same architecture are the same. The same hyperparameters can be used to train the two networks. It greatly reduces the difficulty of training. However, if the cross-supervised learning is performed on two networks with different architectures, theoretically, the two networks can complement each other better but the hyperparameters are harder to set. Performing cross-supervised learning on heterogeneous models by smart hyper-parameter setting strategy would be a promising direction to explore in future.

6. Conclusion

In this paper, a new learning paradigm, that is, cross-supervised learning, is proposed for empowering the networks to learn from both the labeled and unlabeled data. The cross-supervised learning is conducted on two base networks by supervised training for labeled data and mutually supervised training for unlabeled data. Furthermore, a cloud detection model, that is, Intensive Nets, based on cross-supervised learning is developed. Benefiting from the cross-supervised learning and the built-in Intensive Nets, the In-extensive Nets can make full use of both labeled and unlabeled cloudy images, substantially reducing the number of cloud labels (that tend to cost expensive manual effort) required for training. The experimental results on the HY1C-UPC and GF1-WHU datasets have validated that our In-extensive Nets perform well and have an obvious advantage in the case where there are few labeled cloudy images available for training.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Project No. 61971444, in part by the National Key R&D Program of China under Project No. 2019YFC1408400, and in part by the Innovative Research Team Program for Young Scholars at Universities in Shandong Province under Project No. 2020KJN010.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data Availability Statement

The GF1-WFV data that support the findings of this study are openly available at https://doi.org/10.1016/j.rse.2017.01.026, reference number li2017multi. For the HY1C CZI data, raw data were collected from https://osdds.nsoas.org.cn/. The training and testing data with cloud masks of this study are available along with our code at https://gitee.com/kang_wu/in-extensive-nets.

Additional information

Funding

This work was supported in part by the National Natural Science Foundation of China under Project No. 61971444, in part by the National Key R&D Program of China under Project No. 2019YFC1408400, and in part by the Innovative Research Team Program for Young Scholars at Universities in Shandong Province under Project No. 2020KJN010; The Innovative Research Team Program for Young Scholars at Universities in Shandong Province; National Key Research and Development Program of China National Key Research and Development Program of China [2019YFC1408400];

References

  • Chen, L.-C., G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2017. “Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4): 834–21.
  • Foga, S., P. L. Scaramuzza, S. Guo, Z. Zhu, R. D. Dilley Jr, T. Beckmann, G. L. Schmidt, J. L. Dwyer, M. J. Hughes, and B. Laue. 2017. “Cloud Detection Algorithm Comparison and Validation for Operational Landsat Data Products.” Remote Sensing of Environment 194: 379–390. doi:10.1016/j.rse.2017.03.026.
  • Guo, J., Q. Xu, Y. Zeng, Z. Liu, and X. Zhu. 2022. “Semi-Supervised Cloud Detection in Satellite Images by considering the Domain Shift Problem.” Remote Sensing 14 (11): 2641. doi:10.3390/rs14112641.
  • Guo, J., J. Yang, H. Yue, X. Liu, and K. Li. 2021. “Unsupervised Domain-Invariant Feature Learning for Cloud Detection of Remote Sensing Images.” IEEE Transactions on Geoscience and R Emote Sensing 60: 1–15.
  • Guo, J., J. Yang, H. Yue, H. Tan, C. Hou, and K. Li. 2020. “CDnetV2: CNN-based Cloud Detection for Remote Sensing Imagery with cloud-snow Coexistence.” IEEE Transactions on Geoscience and Remote Sensing 59 (1): 700–713. doi:10.1109/TGRS.2020.2991398.
  • Hadjimitsis, D. G., C. R. Clayton, and A. Retails. 2009. “The Use of Selected Pseudo- invariant Targets for the Application of Atmospheric Correction in multi-temporal Studies Using Satellite Remotely Sensed Imagery.” International Journal of Applied Earth Observation and Geoinformation 11 (3): 192–200. doi:10.1016/j.jag.2009.01.005.
  • Han, L., P. Li, A. Plaza, and P. Ren. 2021. “Hashing for Localization (Hfl): A Baseline for Fast Localizing Objects in A Large-Scale Scene.” IEEE Transactions on Geoscience and R Emote Sensing 60: 1–16.
  • He, Q., X. Sun, Z. Yan, and K. Fu. 2021. “DABNet: Deformable Contextual and boundary-weighted Network for Cloud Detection in Remote Sensing Images.” IEEE Transactions on Geoscience and R Emote Sensing 60: 1–16.
  • He, K., X. Zhang, S. Ren, and J. Sun. 2015. “Delving Deep into Rectifiers: Surpassing human-level Performance on Imagenet Classification.” 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile, 1026–1034. doi:10.1109/ICCV.2015.123.
  • Hu, X., Y. Wang, and J. Shan. 2015. “Automatic Recognition of Cloud Images by Using Visual Saliency Features.” IEEE Geoscience and Remote Sensing Letters 12 (8): 1760–1764. doi:10.1109/LGRS.2015.2424531.
  • Ioffe, S., and C. Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” International Conference on Machine Learning 448–456. PMLR
  • Jeppesen, J. H., R. H. Jacobsen, F. Inceoglu, and T. S. Toftegaard. 2019. “A Cloud Detection Algorithm for Satellite Imagery Based on Deep Learning.” Remote Sensing of Environment 229: 247–259. doi:10.1016/j.rse.2019.03.039.
  • Kingma, D. P., and J. Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980.
  • Kriebel, K.-T., G. Gesell, M. Ka Stner, and H. Mannstein. 2003. “The Cloud Analysis Tool APOLLO: Improvements and Validations.” International Journal of Remote Sensing 24 (12): 2389–2408. doi:10.1080/01431160210163065.
  • Krizhevsky, A., I. Sutskever, G. E. Hinton, and E. P. Simoncelli. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25: 3113–3121.
  • Laine, S., and T. Aila. 2016. “Temporal Ensembling for semi-supervised Learning.” arXiv preprint arXiv:1610.02242.
  • LeCun, Y., Y. Bengio, and G. Hinton. 2015. “Deep Learning.” nature 521 (7553): 436–444. doi:10.1038/nature14539.
  • LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. 10.1109/5.726791
  • Lin, C.-H., B.-Y. Lin, K.-Y. Lee, and Y.-C. Chen. 2015. “Radiometric Normalization and Cloud Detection of Optical Satellite Images Using invariant Pixels.” ISPRS Journal of Photogrammetry and Remote Sensing 106: 107–117. doi:10.1016/j.isprsjprs.2015.05.003.
  • Li, Z., H. Shen, Q. Cheng, Y. Liu, S. You, and Z. He. 2019. “Deep Learning Based Cloud Detection for Medium and High Resolution Remote Sensing Images of Different Sensors.” ISPRS Journal of Photogrammetry and Remote Sensing 150: 197–212. doi:10.1016/j.isprsjprs.2019.02.017.
  • Li, Z., H. Shen, H. Li, G. Xia, P. Gamba, and L. Zhang. 2017b. “Multi-feature Combined Cloud and Cloud Shadow Detection in GaoFen-1 Wide Field of View Imagery.” Remote Sensing of Environment 191: 342–358. doi:10.1016/j.rse.2017.01.026.
  • Li, Z., H. Shen, Q. Weng, Y. Zhang, P. Dou, and L. Zhang. 2022. “Cloud and Cloud Shadow Detection for Optical Satellite Imagery: Features, Algorithms, Validation, and Prospects.” ISPRS Journal of Photogrammetry and Remote Sensing 188: 89–108. doi:10.1016/j.isprsjprs.2022.03.020.
  • Li, Y., Y. Zhang, X. Huang, H. Zhu, and J. Ma. 2017a. “Large-scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks.” IEEE Transactions on Geoscience and Remote Sensing 56 (2): 950–965. doi:10.1109/TGRS.2017.2756911.
  • Löw, F., D. Dimov, S. Kenjabaev, S. Zaitov, G. Stulina, and V. Dukhovny. 2022. “Land Cover Change Detection in the Aralkum with multi-source Satellite Datasets.” GIScience & Remote Sensing 59 (1): 17–35. doi:10.1080/15481603.2021.2009232.
  • Luo, C., S. Feng, X. Yang, Y. Ye, X. Li, B. Zhang, Z. Chen, and Y. Quan. 2022. “LWCDnet: A Lightweight Network for Efficient Cloud Detection in Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–16. doi:10.1109/TGRS.2022.3173661.
  • Luotamo, M., S. Metsämäki, and A. Klami. 2020. “Multiscale Cloud Detection in Remote Sensing Images Using a Dual Convolutional Neural Network.” IEEE Transactions on Geoscience and Remote Sensing 59 (6): 4972–4983. doi:10.1109/TGRS.2020.3015272.
  • Mohajerani, S., and P. Saeedi. 2019. “Cloud-Net: An end-to-end Cloud Detection Algorithm for Landsat 8 Imagery.“ 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 1029–1032. doi:10.1109/IGARSS.2019.8898776.
  • Niederheiser, R., M. Winkler, V. Di Cecco, B. Erschbamer, R. Fernández, C. Geitner, H. Hofbauer, et al. 2021. “Using Automated Vegetation Cover Estimation from Close- Range Photogrammetric Point Clouds to Compare Vegetation Location Properties in Mountain Terrain.” GIScience & Remote Sensing 58 (1): 120–137. DOI:10.1080/15481603.2020.1859264.
  • Noble, W. S. 2006. “What Is a Support Vector Machine?” Nature Biotechnology 24 (12): 1565–1567. doi:10.1038/nbt1206-1565.
  • Pan, B., Z. Shi, and X. Xu. 2017. “Hierarchical Guidance Filtering-Based Ensemble Classification for Hyperspectral Images.” IEEE Transactions on Geoscience and Remote Sensing 55 (7): 4177–4189. doi:10.1109/TGRS.2017.2689805.
  • Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, et al. 2019. “Pytorch: An Imperative Style, high-performance Deep Learning Library.” Advances in Neural Information Processing Systems. 32.
  • Rossow, W. B., and L. C. Garder. 1993. “Cloud Detection Using Satellite Measurements of Infrared and Visible Radiances for ISCCP.” Journal of Climate 6 (12): 2341–2369. doi:10.1175/1520-0442(1993)006<2341:CDUSMO>2.0.CO;2.
  • Shi, C., C. Wang, Y. Wang, and B. Xiao. 2017. “Deep Convolutional activations-based Features for ground-based Cloud Classification.” IEEE Geoscience and Remote Sensing Letters 14 (6): 816–820. doi:10.1109/LGRS.2017.2681658.
  • Smith, A. R. 1979. “Tint Fill.” The 6th Annual Conference on Computer Graphics and Interactive Techniques, Chicago, USA, 276–283. doi:10.1145/965103.807456.
  • Tai, X., M. Li, M. Xiang, and P. Ren. 2022. “A Mutual Guide Framework for Training Hyperspectral Image Classifiers with Small Data.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–17. doi:10.1109/TGRS.2021.3092351.
  • Tan, M., and Q. Le. 2019. “Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.” 2019 International Conference on Machine Learning, Long Beach, CA, USA, 6105–6114.
  • Tarjan, R. 1972. “Depth-first Search and Linear Graph Algorithms.” SIAM Journal on Computing 1 (2): 146–160. doi:10.1137/0201010.
  • Tarvainen, A., and H. Valpola. 2017. “Mean Teachers are Better Role Models: Weight- Averaged Consistency Targets Improve semi-supervised Deep Learning Results.” Advances in Neural Information Processing Systems 30.
  • Wold, S., K. Esbensen, and P. Geladi. 1987. “Principal Component Analysis.” Chemometrics and Intelligent Laboratory Systems 2 (1–3): 37–52. doi:10.1016/0169-7439(87)80084-9.
  • Wu, K., Z. Xu, X. Lyu, and P. Ren. 2022. “Cloud Detection with Boundary Nets.” ISPRS Journal of Photogrammetry and Remote Sensing 186: 218–231. doi:10.1016/j.isprsjprs.2022.02.010.
  • Xie, F., M. Shi, Z. Shi, J. Yin, and D. Zhao. 2017. “Multilevel Cloud Detection in Remote Sensing Images Based on Deep Learning.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (8): 3631–3640. doi:10.1109/JSTARS.2017.2686488.
  • Yang, J., J. Guo, H. Yue, Z. Liu, H. Hu, and K. Li. 2019. “CDnet: CNN-based Cloud Detection for Remote Sensing Imagery.” IEEE Transactions on Geoscience and Remote Sensing 57 (8): 6195–6211. doi:10.1109/TGRS.2019.2904868.
  • Yan, Z., M. Yan, H. Sun, K. Fu, J. Hong, J. Sun, Y. Zhang, and X. Sun. 2018. “Cloud and Cloud Shadow Detection Using Multilevel Feature Fused Segmentation Network.” IEEE Geoscience and Remote Sensing Letters 15 (10): 1600–1604. doi:10.1109/LGRS.2018.2846802.
  • Yu, X., X. Wu, C. Luo, and P. Ren. 2017. “Deep Learning in Remote Sensing Scene Classification: A Data Augmentation Enhanced Convolutional Neural Network Framework.” GIScience & Remote Sensing 54 (5): 741–758. doi:10.1080/15481603.2017.1323377.
  • Zhang, Y., W. B. Rossow, A. A. Lacis, V. Oinas, and M. I. Mishchenko. 2004. “Calculation of Radiative Fluxes from the Surface to Top of Atmosphere Based on ISCCP and Other Global Data Sets: Refinements of the Radiative Transfer Model and the Input Data.” Journal of Geophysical Research: A Tmospheres 109 (D19). doi:10.1029/2003JD004457.
  • Zhao, H., J. Shi, X. Qi, X. Wang, and J. Jia. 2017. “Pyramid Scene Parsing Network.” 2017 IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2881–2890. doi:10.1109/CVPR.2017.660.
  • Zhu, Z., S. Wang, and C. E. Woodcock. 2015. “Improvement and Expansion of the Fmask Algorithm: Cloud, Cloud Shadow, and Snow Detection for Landsats 4–7, 8, and Sentinel 2 Images.” Remote Sensing of Environment 159: 269–277. doi:10.1016/j.rse.2014.12.014.
  • Zhu, Z., and C. E. Woodcock. 2012. “Object-based Cloud and Cloud Shadow Detection in Landsat Imagery.” Remote Sensing of Environment 118: 83–94. doi:10.1016/j.rse.2011.10.028.