1,995
Views
0
CrossRef citations to date
0
Altmetric
Review Article

A review of remote sensing image segmentation by deep learning methods

, ORCID Icon, , & ORCID Icon
Article: 2328827 | Received 23 Nov 2023, Accepted 05 Mar 2024, Published online: 18 Mar 2024

ABSTRACT

Remote sensing (RS) images enable high-resolution information collection from complex ground objects and are increasingly utilized in the earth observation research. Recently, RS technologies are continuously enhanced by various characterized platforms and sensors. Simultaneously, artificial intelligence vision algorithms are also developing vigorously and playing a significant role in RS image analysis. In particular, aiming to divide images into different ground elements with specific semantic labels, RS image segmentation could realize the visual acquisition and interpretation. As one of the pioneering methods with the advantages of deep feature extraction ability, deep learning (DL) algorithms have been exploited and proved to be highly beneficial for precise segmentation in recent years. In this paper, a comprehensive review is performed on remote sensing survey systems and various kinds of specially designed deep learning architectures. Meanwhile, DL-based segmentation methods applied on four domains are also illustrated, including geography, precision agriculture, hydrology, and environmental protection issues. In the end, the existing challenges and promising research directions in RS image segmentation are discussed. It is envisioned that this review is able to provide a comprehensive and technical reference, deployment and successful exploitation of DL empowered RS image segmentation approaches.

1. Introduction

As a practical and comprehensive technology for non-contact detection, remote sensing (RS) is useful with its characteristics of wide coverage, high spatial/spectral resolution, and variable data structures to obtain specific target information from distance (N. Zhang et al. Citation2020). In particular, from RGB and near infrared (NIR) to thermal infrared (TIR) and microwave bands, the wavelengths accessible depend on the remote sensing sensors (see ), including digital cameras, multispectral cameras, hyperspectral instruments, infrared sensors, and lidars. Such commonly used sensors are employed on aircrafts, unmanned aerial vehicles (UAVs), satellites, and various platforms (T. Zhang et al. Citation2024). Considering the diverse properties of remote sensing sensors and the platforms mounted on, different combinations are utilized to obtain helpful remote sensing data (e.g. RS images) for task-specific applications. The acquirement and applications are illustrated in . In recent years, the significant advances of RS technology have led to the promotion of remote sensing images in the spatial/spectral/temporal resolution, providing robust datasets for RS downstream tasks. Therefore, RS images have been widely used in different fields and domains, which are indispensably crucial to the Earth observation surveys, land use monitoring, and environment protection (Yasir et al. Citation2023).

Figure 1. The spectral wavelengths/bands that different sensors contained.

Figure 1. The spectral wavelengths/bands that different sensors contained.

Figure 2. The procedure of remote sensing image segmentation.

Figure 2. The procedure of remote sensing image segmentation.

Given that the RS image coverage has gradually expanding with various types of land objects, it is increasingly difficult to conduct information extraction for RS images. Consequently, as a significant method for information interpretation, image segmentation method is able to process the RS images with different paradigms, hence analyzing the features extracted from the collected images. It is noteworthy that segmenting remote sensing images are especially different from natural images considering the characteristics of multichannel, abundant details, and few labeled samples (W. Liu et al. Citation2020). Traditional image segmentation methods (e.g. pixel-based and edge-detection-based methods) utilized for early interpretation of images mainly rely on shallow semantic information, such as image texture and color gradient, restricting the further capability to obtain high-level semantic information (J. Yuan, Wang, and Li Citation2014). In addition, for target extraction tasks in complex environments, traditional methods cannot meet the needs of RS image understanding (N. Zhang et al. Citation2020).

As a result, increasing interests have been drawn by scholars to explore the potential of deep neural networks applying on RS image segmentation since 2014 (Y. Chen et al. Citation2014). Unlike traditional image segmentation models, learning-based approaches approximate the complicated nonlinear relation via designed multi-layer networks, finding out the hidden information from the abundant spectral/spatial data. Besides, DL has shown great superiority in multi-scale feature extraction and combination, outperforming traditional models with substantial improvement in RS image processing and analyzing problems. Under this circumstance, continuous development of new architecture designs and other advanced DL methods are proposed to enhance segmentation performance and accuracy of RS images. For example, RS image segmentation is conventionally interpreted by classical Convolutional Neural Networks (CNNs) and Transformer Networks (e.g. Swin Transformer Z. Liu et al. Citation2021) for RS downstream tasks (Kampffmeyer, Salberg, and Jenssen Citation2016; Nogueira et al. Citation2019; L. Wang, Li, Duan et al. Citation2022; Z. Xu et al. Citation2021). Some novel networks are also designed for RS image segmentation, including LANet (Ding, Tang, and Bruzzone Citation2020), HRCNet (Z. Xu et al. Citation2020), and MANet (R. Li et al. Citation2021). Meanwhile, self-supervised learning (W. Li, Chen, and Shi Citation2021; H. Li et al. Citation2022), contrastive learning (Bai et al. Citation2022; J.-X. Wang et al. Citation2022), and transfer learning (Cui, Chen, and Lu Citation2020; Panboonyuen et al. Citation2019; Wurm et al. Citation2019) have also been introduced to deal with sensing segmentation problems and achieved promising performance.

Thanks to the strong capability of feature representation, DL has become a superior algorithm that widely applied on different sub-fields with complex scenarios, including geological, agricultural, hydrological, and environmental applications. For example, Kussul compared CNNs with multilayer perceptron (MLP) and random forest for land cover classification problem (Kussul et al. Citation2017). Zhang proposed a new deep convolutional neural network (DCNN)-based approach for automated crop disease detection using hyperspectral images captured by UAVs (X. Zhang et al. Citation2019). Yuan designed a novel model-multichannel network named MC-WBDN for robust water body detection (K. Yuan et al. Citation2021). Martins applied CNN on tree canopies segmentation in urban environments using aerial RGB imagery (Martins et al. Citation2021). As a consequence, it could be found that due to different sensing platform and unique requirement in each task, DL algorithms are often specifically designed for particular research purpose. It means that there are a considerable number of networks and image processing methods designed for RS images with different resolutions and task requirements respectively.

Therefore, to obtain a comprehensive understanding regarding the algorithms and applications of remote sensing image segmentation using DL methods, a systematic review is necessarily required, providing beneficial information for further progressed researches. Comparing with the current existing reviews about RS image segmentation, our review contains comprehensive introductions of extensive researches, multiple types of application fields, and overall reviews about both classical models and advanced techniques.

  • Our review provided complete introductions of various works about typical research and applications with clear categorizations for each section. For previous reviews, Li introduced three DL image classification models, including convolutional neural networks, stacked autoencoders and deep belief networks, with their performance on publicly available datasets and imagery, but did not introduce further applications on downstream tasks (Y. Li et al. Citation2018). Huang provided a comprehensive review of RS image segmentation with discussions about advanced methods and main challenges, but also took little account in the field of corresponding applications (L. Huang et al. Citation2023). On the contrary, our review devised a multi-perspective categorization of DL models for segmenting RS images from three dimensions (namely various architectures of RS image segmentation methods, label-efficient approaches, and the popular domain applications), so that readers could quickly grasp the overall development of RS image segmentation.

  • Our review covered multiple types of application fields concerning image segmentation approaches. For previous reviews, Neupane reviewed and performed a meta-analysis about deep learning-based semantic segmentation on urban scenes in satellite images (Neupane, Horanont, and Aryal Citation2021). Zang introduced algorithm development, approaches summarizations and quantitative experiments focusing on land-use mapping task (Zang et al. Citation2021). On the contrary, we introduced and itemized the algorithm implementation among four popular downstream domains in detail, including geological applications, precision agriculture, hydrological applications, and environmental protection, making the review more comprehensive and inspiring for stakeholders of specific domain.

  • Our review analyzed classical models and advanced methods (e.g. multi-modal models, diffusion models, and foundation models) related to the algorithm development in the field of natural image segmentation. For previous reviews, Yuan reviewed the development of deep learning methods for RS image segmentation, mainly focusing on the methods for non-conventional data and the demands for accuracy improvement and the lack of samples (X. Yuan, Shi, and Gu Citation2021). Lv summarized the advantages and disadvantages of segmentation models on RS images with detailed introduction about each structure, but took little account in the field of label efficient methods and other advanced methods (Lv et al. Citation2023). On the contrary, we provided an overall review covering some emerging and developing techniques recently, summarizing innovations and challenges for future points and further developments.

In summary, compared to previous articles, our review is more comprehensive with summarized challenges and corresponding approaches plus with categorized classifications of applications. Since adopting DL methods in remote sensing image segmentation will lead to several challenges not appeared before, we also analyzed the relevant challenges and introduced the proposed methods responding to them. With regard to the advantages of different DL methods, we will conduct a detailed review in the following parts. The main contributions of this review are summarized as follows:

  • The survey covers 160 DL algorithms and correlated methods for comprehensive review, providing an overview of deep learning-based methods for image segmentation in remote sensing field, including the respective analysis of different models and the applicability of DL models responding to existing challenges.

  • We provide an introduction of DL approaches applying for remote sensing image segmentation for popular and promising downstream tasks, which could be categorized into four types in our review (e.g. geological applications, precision agriculture, hydrological applications, and environmental protection issues).

  • Some directions by using novel DL strategies are introduced, which can be applied in remote sensing image analysis in the future.

2. Remote sensing observation platforms and sensors

The quality of remote sensing images is determined by the spatial resolution, spectral resolution, and temporal resolution of data measuring system respectively. A high quality image is generally determined by high resolutions. Poor spatial resolution displays large pixels with poor land cover characteristics. Spectral resolution indicates the number of spectral bands for reflected radiance collection that determined by the number of bands and bands position in the electromagnetic spectrum. The temporal resolution defines revisiting frequency for a certain location. Regarding RS images for specific tasks, two remote sensing systems are popularly used, namely aviation remote sensing and space remote sensing. They are important parts of Earth observation system, complementing each other in various applications.

Aviation platforms mainly include manned aircraft, helicopter, unmanned aerial vehicle (UAV), balloon, etc. Since these low altitude platforms collect data at the highest flying-height from 100 to 18,000 m (Everaerts Citation2008), they could make up for the spatial and real-time monitoring limitations of traditional facilities (e.g. tripods, vehicles, and fixed rail systems) (N. Zhang et al. Citation2020). Therefore, with high efficiency and flexible operations, aviation remote sensing has become important platform to obtain high-resolution sensing images. Among these platforms, UAV-based ones are low cost and flexible to provide high spatial resolution data due to the low flight altitude. Unlike manned aircraft requiring high costs and maintenance price, UAV could fill the resolution gap between manned aircraft (0.2–2 m) and traditional ground-based platforms (<1 cm) with a typical spatial resolution of 1–20 cm (Wójtowicz, Wójtowicz, and Piekarczyk Citation2016). In addition, UAV is popular for its small volume, light weight, and simple operations, thus being widely applied in the fields of agriculture and forestry monitoring applications recently.

As depicted in , UAV sensing system is mainly composed of UAV platforms, sensing units, and ground control systems. According to different ways to provide lift, UAV platforms could be divided into three categories, namely rotor-wing, fixed-wing, and hybrid UAVs. Rotor-wing UAVs rely on spinning rotors to provide aerodynamic lift while fixed-wing UAVs using rigid wings like an aeroplane. Besides, hybrid UAVs take advantage of fixed-wing and rotor-wing UAVs concurrently to possess a long flight duration. Various sensors have respective advantages in survey process. The optical sensing sensors with different spectral wavelengths are illustrated in , including RGB camera, color infrared camera, multispectral camera, hyperspectral camera, LiDAR, and thermal camera. RGB cameras generally attain red, green, and blue band information with the advantages of low cost and simple operation but showing limiting capabilities to obtain inherent target objects band details when they are similar (T. Zhang et al. Citation2024). Thermal sensors could obtain the data used for identifying ground objects and inversion surface parameters such as temperature and humidity. Therefore, thermal systems are capable of capturing objects changes, providing useful information for specific tasks such as natural disasters detection and plant disease early warning (Su et al. Citation2023). However, thermal sensors are not suitable for large-scale analysis due to that infrared thermal imaging is of blurred edges and low signal-to-noise ratio (N. Zhang et al. Citation2020).

Figure 3. UAV sensing systems are comprised of UAV platforms, sensing units, and ground control systems.

Figure 3. UAV sensing systems are comprised of UAV platforms, sensing units, and ground control systems.

Table 1. Optical sensing sensors with different spectral wavelengths.

Considering that the spectral data reflected by land objects are various, the introduction of hyperspectral and multispectral cameras can increase the data diversity, thus improving remote sensing observation performance. The spectral resolution of hyperspectral image is on the order of 0.01 with hundreds of bands in the visible and near-infrared regions, while the one of multispectral image is on the order of 0.1 with only several bands. Compared with multispectral cameras, hyperspectral ones could obtain much more spectral information for accurate images analysis, paving the way for sophisticated tasks such as crop disease prediction and geological hazards assessment. However, hyperspectral cameras are expensive, hence only being applied for scientific researches. As a result, multispectral cameras are considered as a relative precise and cost-effective sensing unit recently (Jang et al. Citation2020; T. Zhang et al. Citation2024).

For space remote sensing, both satellites, space stations and rockets have been explored as platforms in the region scale sensing image assessment. Since the first Earth Resources Technology Satellites (ERTS) launched in 1972, satellites have been gradually utilized and achieved a general application on collecting remote sensing data, indicating the new era of remote sensing. With the lowest flight height of 150 km, satellites could obtain the electromagnetic wave information reflected by target objects in a long distance, resulting in the sensing images and spectral data required for follow-up processing. The band information details about some typical satellite sensors are shown in . Landsat-1 was first launched in 1972 by NASA with four bands information (green, red, two infrared) at a resolution of 80 m and a revisit time of 18 days (Su et al. Citation2023). Later, in 1984, Landsat-5 was launched equipped with Thematic Mapper sensor at a spatial resolution of 30 m in blue, green, red, near-infrared, and three infrared (including thermal) bands (Masek et al. Citation2006). It has been widely used in remote sensing area several decades later. The launch of Sentinel-2A is a key part of Global Monitoring for Environment and Security program (GMES) supported by the European Space Agency (ESA) and European Commission (EC) towards the needs of conducting agriculture, forest monitoring, and natural disaster management (Drusch et al. Citation2012). The most remarkable characteristic of satellite remote sensing image is the rich semantics information provided by its high spatial/spectral resolution in region scale, satisfying the general need in land use, Earth mapping, disaster detection, etc. It could be said that satellite sensing is the best choice for regional scale mapping research (N. Zhang et al. Citation2020).

Table 2. Band information details of typical satellite sensors.

Comparing drawbacks of the two popular platforms mentioned above (e.g. UAV-based and satellite-based platforms), it could be found that UAV-based platforms are much more likely to be disturbed by the harsh environment and could not equip large sensors. On the contrary, although the single image coverage collected by satellites is undoubtedly wide, it lead to lower spatial resolutions than remote sensing images from UAV-based platforms. In addition, the huge imaging data brings high cost at the duration of data transmission between satellites and earth. Currently, the interpolation, reconstruction, and other technologies are used to solve this problem (Keshk and Yin Citation2017).

Based on the aforementioned platforms and sensors, various remote sensing systems are deployed for specific tasks. In fact, the data collected in remote sensing area is generally customized RS image dataset for the segmentation target studied. Specifically, various datasets with different spatial and spectral resolution are integrated with specially designed artificial intelligence algorithms for industry applications. Moreover, in recent years, related institutions have conducted publicly available RS image segmentation datasets, pushing the promotion of DL model via high-quality training. We briefly show some common public datasets in . For example, ISPRS Vaihingen and Postdam images are aerial datasets collected over the area of city. Vaihingen contains three band information (near-infrared, red, and green) while Potsdam is captured in four channels (near-infrared, red, green, and blue). GID is a large-scale dataset conducted for land use and land cover classification. It contains 150 high-quality images collected by Gaofen-2 from over 60 different cities in China, providing satellite images in blue, green, red, and near-infrared bands.

Figure 4. Commonly utilized RS images for segmentation: (a) ISPRS Potsdam images (UAV optical images), (b) ISPRS Vaihingen images (UAV optical images), and (c) Gaofen Image Dataset (GID) (Satellite multispectral images).

Figure 4. Commonly utilized RS images for segmentation: (a) ISPRS Potsdam images (UAV optical images), (b) ISPRS Vaihingen images (UAV optical images), and (c) Gaofen Image Dataset (GID) (Satellite multispectral images).

Considering the information limitation of sensors that capture images in RGB bands, multi-spectral/hyper-spectral sensors are mostly utilized for the expectation of attaining images with abundant spectral information, hence becoming popular components of RS systems. We choose two systems as examples in the following.

First, unmanned aerial spectral system adopts UAV equipped with multi-spectral/hyper-spectral sensors to obtain images with high spectral resolution. In addition, since being collected at a relative low altitude, multiple objects are successfully obtained in agriculture and urban area, providing rich details for objects detection and classification problems. Nevertheless, with chosen target scenarios, UAVs are limited in the observation view and landmark elements. In the agriculture sensing technology, Osco utilized a multispectral camera with four bands in green, red, red-edge, and near-infrared on UAV to estimate the number and location of citrus trees (Osco et al. Citation2020). For urban area, the evolution of UAV based remote sensing is efficient to solve urban management and ensure the sustainability simultaneously (Noor, Abdullah, and Hashim Citation2018). Hartling developed a fusion technique to integrate multi-sensor data, demonstrating its potential for object classification under the complex urban environment (Hartling, Sagan, and Maimaitijiang Citation2021). UAV multispectral images were also applied in water quality monitoring (Y. Xiao et al. Citation2022) and the detection of asphalt pavement potholes and cracks (Y. Pan et al. Citation2018).

Another frequently used system is spectral satellite system. With large coverage, huge annotation types and wide spectral resolution, the obtained images have abundant semantic information for analysis, thus being widely utilized in environmental observation such as hydrological and geological applications. However, one critical challenge should be concerned for satellite data when applying on deep learning training models: a large amount of labels on images with large coverage lead to high costs of manual annotations and rather low utilization of satellite images. Hence, researchers resort to diverse label-efficient methods for satellite image segmentation, which would be introduced in the following. For downstream applications, several methods are adopted to improve land use and land cover classification performance using multi-spectral satellite imagery (Kim et al. Citation2018; Kupidura Citation2019). In addition, spectral satellite images are applied in marine environmental monitoring, marine pollution detection, ship detection, and so on.

3. Deep learning methods in remote sensing image segmentation

3.1. Challenges of deep learning applied on remote sensing image segmentation

Significant differences exist between remote sensing images and natural images from the aspects of data acquisition, detail information, and application fields. As mentioned above, RS systems generally contain diverse sensing platforms (aviation platforms, space sensing platforms, etc.) and sensors (optical cameras, hyper/multi-spectral cameras, LiDAR, etc.). These sensing systems could collect and record electromagnetic wave information radiated or reflected by objects in the target region. After that, the obtained information could be transformed into visual RS images via electromagnetic wave conversion and recognition. Hence, unlike natural images that collected by optical cameras, RS images usually embody plentiful spectral information from surface, presenting the fundamental material composition data for agriculture, geological exploration, and other applications in the realm of Earth observation. Besides, different from the natural images that generally contain objects in daily life, multiple types of land features (e.g. water bodies, vegetation, buildings) under various environments are reflected in RS images. Consequently, the pre-processing and annotations of RS images are more expensive than natural images with necessary demands for expertise, making the available large-scale labeled datasets of RS scenes much less than that of natural scenes.

Considering the aforementioned difference and unique characteristics of RS images compared with natural images, semantic segmentation methods for respective field are different in the aspects of data processing, feature extraction, and training paradigm. For example, novel modules and structures are designed in RS image segmentation models to make an effective integration of RS data from multiple spectral bands, which are not necessary in the natural image segmentation tasks. Therefore, despite the mature development of image segmentation algorithms in the field of natural images, such advanced DL models could not be directly applied on RS images segmentation tasks and typically require fresh design.

In summary, applying deep learning on remote sensing image segmentation are facing following difficulties and challenges. First, RS images are generally of high resolution with abundant details, including very high-resolution images and the wide spectrum of multi-spectral/hyper-spectral data. In addition, concerning the performance bottleneck of single-modality methods in complex RS scenes, multiple modalities data resources have also been investigated to resolve the limitation of information diversity. Making full use of the increasing multi-modal Earth observation data, multi-modal methods are able to attain further performance increase in semantic segmentation task. As a consequence, exploring the effective fusion methods and data integration to make full use of these data is one of the significant challenges to improve the segmentation performance. Second, different from simple object classification of natural images, the annotation of RS images demands much expertise. With limited labeled samples given in remote sensing image datasets, the performance of supervised learning is restricted to some extent, resulting in a strong impediment on promoting deep learning in conventional remote sensing tasks.

Considering the aforementioned challenges, various architecture designs and label-efficient approaches are proposed for performance development in semantic segmentation task, which will be introduced in detail in the following sections.

3.2. The various architecture design of deep networks

Supervised networks typically take RS images as inputs and perform a series of operations (e.g. convolution, pooling, and local normalization) to transform them into output semantic labels. As a key underpinning of DL-based segmentation algorithms, CNN was initially proposed and gradually promoted with the innovation of shared weights and back-propagation training. As a typical example, LeNet-5 made up of CNN structure was introduced for digits recognition as shown in . In general, CNNs contain three types of layers, namely convolutional layers, nonlinear layers, and pooling layers. With all the receptive fields in a layer sharing weights, convolutional layers significantly decrease the parameters comparing with fully-connected neural networks. Inspired by the fully convolutional networks (FCN) (Long, Shelhamer, and Darrell Citation2015) that first developed for pixel-level segmentation, numerous deep learning-based segmentation methods are proposed with different underlying architectural contributions, leading to general applications on the field of remote sensing. These specially designed architectures are correlated to the model performance and specifically address the challenges of huge details and wide spectrum when segmenting RS images using DL algorithms. Considering the architectures of DL-based segmentation works, we summarize them into three types, namely CNN-based, Transformer-based, and hybrid structure-based models.

Figure 5. Architecture of convolutional neural networks (LeCun et al. Citation1998).

Figure 5. Architecture of convolutional neural networks (LeCun et al. Citation1998).

3.2.1. CNN-based models

One of the popular family of CNN-based models adopts the convolutional encoder–decoder architecture. Encoder–decoder mode contains a two-stage network capturing the deep semantic information of the input by the encoder and predicting the output from the latent-space representation via the decoder. SegNet was a classical encoder–decoder structure with corresponding encoder and decoder having the same spatial size and channel numbers (Badrinarayanan, Kendall, and Cipolla Citation2017). In addition, the typical convolutional networks were based on the basic idea of spatially sharing weights to reduce the parameters so that improving training performance, generally consisting of convolutional layers, max-pooling layers, and fully-connected (FC) layers. Since Deep Residual Network (ResNet) (K. He et al. Citation2016) was proposed as a significant event in the evaluation of CNN, the training difficulty of deep CNN models has been successfully resolved, enabling very deep CNN networks to comprehensively exploit the image semantic information.

Ever since then, the innovation of image segmentation networks has mainly focused on the optimization of the encoder and decoder structure for better performance and efficiency. In recent years, deep CNN-based models have achieved great success and wide range applications in the image segmentation tasks of different fields. Some classic CNN-based image segmentation approaches and advanced innovations are summarized in . Besides, convolution blocks are also adopted to develop sub-networks of multi-modal models. Hong proposed a general multi-modal deep learning framework based on extraction network and five types of fusion networks to integrate the features from hyper-spectral image(HSI), LiDAR image, and SAR image, realizing the utilization of diverse information for better RS image segmentation performance (Hong et al. Citation2020). Deeplab series (e.g. DeeplabV1 L.-C. Chen et al. Citation2014, DeeplabV2 L.-C. Chen, Papandreou, Kokkinos et al. Citation2017, DeeplabV3 L.-C. Chen, Papandreou, Schroff et al. Citation2017, and DeeplabV3+ L.-C. Chen et al. Citation2018) are continuous evaluation and extension based on the technique of dilated convolutions, which introduced dilation rate parameter to convolutional operations so that addressing the decreasing resolution problem caused by max-pooling and striding. Atrous spatial pyramid pooling (ASPP) was adopted in DeeplabV2 to attain multi-scale information for robust segmentation. The best DeeplabV3+ model utilized DeepLabv3 framework as encoder and added effective decoder module to optimize segmentation results. The ablation studies of DeeplabV3+ shown that adopting the proposed decoder module could improve the mIOU performance from 77.21% to 78.85%.

Table 3. Classic CNN-based models for image segmentation task.

U-Net (Ronneberger, Fischer, and Brox Citation2015) and its variants are another popular family in image segmentation methods. The original U-Net network used a symmetric path and skip-connection to capture contexts in a U shape (e.g. first down-sampling when extracting features with convolutions and then up-sampling using deconvolutions), which is proved to be especially suitable for medical image segmentation as the structure simultaneously combines original information of images from different scales. Considering the complicated spatial/spectral information in RS images, U-Net has also inspired the segmentation works in RS and been applied to cope with the abundant data problems in remote sensing analysis. For example, Diakogiannis newly introduced residual connections, atrous convolutions, and pyramid scene parsing pooling, enabling ResUNet to extract semantics of high-resolution aerial images and perform well on the semantic segmentation (Diakogiannis et al. Citation2020). Other multi-scale model such as HR-Net was also adopted as backbone architecture in RS image segmentation task. Hong proposed a high-resolution domain adaptation network, namely HighDAN for cross-city semantic segmentation (Hong, Zhang, Li, Li, Yao et al. Citation2023). By using multi-modal high-resolution sub-network, HighDAN could learn the preliminary representations for different RS modalities.

Attention mechanism is brought to computer vision area imitating the human visual system in the form of dynamic weight adjustment process. Specifically, attention-based models could weight features according to the importance of inputs via attention blocks adaptively. SENet pioneered channel attention-based models with the core squeeze-and-excitation (SE) block to capture channel-wise relationships so that improving representation ability (Hu, Shen, and Sun Citation2018). DANet was a dual attention network designed for scene segmentation (Fu et al. Citation2019). In particular, the two types of attention modules appended on top of dilated FCN were utilized to model the semantic interdependencies in spatial and channel dimensions respectively. The ablation experiments shown that these attention modules improved performance significantly, exceeding the mIOU baseline by 3.3%. With the development of self-attention in computer vision, it is typically utilized as a spatial attention mechanism to capture global information. Except for CCNet (Z. Huang et al. Citation2019) and Non-local Block (X. Wang et al. Citation2018) applying self-attention in CNN-based networks for segmentation tasks, pure attention-based networks have been proved to achieve better results than convolutional neural networks especially on large datasets. Generally, such networks are summarized as transformer-based architectures that will be introduced in the following. In addition, other popular CNN based architectures are also listed in for a more comprehensive illustration of image segmentation.

In conclusion, based on the operation of convolution, CNNs could capture the local features of images to deal with the detail information from satellite images. With the applications of advanced convolution modules and pyramid structure, CNN-based models are able to expand the receptive field in an adaptive way, hence realizing multi-scale feature fusion for RS image representation.

3.2.2. Transformer-based models

The local property of CNNs limits its capture of global context and long-range dependencies in the images. In addition, the convolutional filter weights are stationary for all inputs regardless of their properties. Inspired by the breakthrough of Transformer (Vaswani et al. Citation2017) from natural language processing (NLP), Transformer-based models have become hot topics in the field of computer vision, showing impressive performance on many vision-related tasks, particularly in image classification and segmentation. As a significant success of Transformer-based models applying to image classification, vision transformers (ViTs) (Dosovitskiy et al. Citation2020) have demonstrated its great potentials in global information modeling.

As shown in , instead of using convolutional structures as CNN-based methods did, ViTs adopted self-attention mechanism to effectively capture the global relationships between patches. According to recent works (Naseer et al. Citation2021; Park and Kim Citation2022), ViTs were demonstrated with the capability of flexibly adjusting receptive fields and learning effective feature representations (Aleissaee et al. Citation2023). As a result, ViTs have been widely utilized in different fields with a variety of variants. As two key components of Transformer-based models, self-attention mechanism and multi-head attention are utilized to learn self-alignment and different interactions between embeddings respectively. Considering that inductive biases are also important for image analysis but not included in ViT architecture, Swin (Z. Liu et al. Citation2021) was proposed to produce hierarchical feature representations. Another popular Transformer backbone for dense prediction tasks is pyramid vision transformer (PVT) architecture (W. Wang et al. Citation2021). It utilized a progressively shrinking pyramid structure and a spatial-reduction attention layer to make it a unified backbone of both CNN and Transformer for producing multi-scale feature maps. Being expected to deal with the representation challenge of long-range dependencies among various bands, Transformer-based algorithms are adopted in the remote sensing image segmentation especially in the field of hyper-spectral/multi-spectral images (W. Wang et al. Citation2022). In particular, these methods are suitable for images with abundant channel data and multi-modal data due to their powerful capabilities of capturing temporal and long-distance information. For example, Yao extended conventional ViT with minimal modifications and proposed a novel Transformer-based multi-modal deep learning framework to make full use of various kinds of data (e.g. hyper-spectral data, LiDAR data, and SAR data). Multi-modal attention and token fusion method are specially designed to promote the information exchange across modalities (J. Yao et al. Citation2023).

Figure 6. The overview of vision transformer's architecture (Dosovitskiy et al. Citation2020).

Figure 6. The overview of vision transformer's architecture (Dosovitskiy et al. Citation2020).

A variety of models based on Swin Transformer have been utilized on public datasets and specific scene targets. For example, Efficient Transformer based on Swin transformer backbone was proposed to improve segmentation efficiency and the edge enhancement methods were meanwhile adopted to cope with the inaccurate edge problems (Z. Xu et al. Citation2021). Wang introduced the Swin Transformer as backbone and proposed a densely connected feature aggregation module (DCFAM) to conduct the relationships among multi-scale semantic features for precise segmentation (L. Wang, Li, Duan et al. Citation2022). The ablation study on the Potsdam and the Vaihingen dataset shown that the introduction of DCFAM yielded a 4.05% increase and a 3% increase respectively. Besides, a novel decoder was designed to densely connect and aggregate features, hence restoring the resolution for segmentation results prediction. It is notable that when RS image provides limited semantics due to the constraint of its spatial resolution, image super-resolution methods, such as DCNet (Hong, Yao et al. Citation2023), could be adopted for data quality improvement. Considering the approaches based on Transformer structures for RGB image segmentation and hyper-spectral image segmentation, a performance comparison is conducted in .

Table 4. Performance comparison of different semantic segmentation methods based on Transformer structures on RS image datasets in terms of overall accuracy (OA).

3.2.3. Hybrid structure-based models

Many research have utilized the merits of CNNs and Transformers as a hybrid structure to improve the segmentation performance, taking advantages of the strong capabilities for information extraction by CNN-based models and the powerful global information modeling by Transformer-based methods. The general frameworks could be categorized into two architectures, including integrating CNN and Transformer by hybrid strategy in a single branch and in dual flow branches. For example, comparing the model structures shown in , hyperspectral image transformer (HiT) classification network (see (a)) integrated convolution operations with Transformer structure to capture the spectral discrepancies and spatial contexts in a single branch (Yang et al. Citation2022). However, it is considered that the abundant band information of HSI could not be sufficiently interpreted by the structure in a single branch. Specifically, local semantics and global interactions captured by one branch are not complementary for achieving accurate classes discrimination. Therefore, on the other hand, two branches structure named Hyper-LGNet (see (b)) utilizing CNN and Transformer models in a dual flow framework was proposed to obtain HSI spatial/spectral feature representation for classification problems (T. Zhang et al. Citation2022).

Figure 7. Two types of hybrid structure-based models using CNNs and transformers: (a) the framework of HiT integrating CNN and Transformer by hybrid strategy in a single branch (Yang et al. Citation2022) and (b) the framework of Hyper-LGNet integrating CNN and Transformer in a dual flow framework (T. Zhang et al. Citation2022).

Figure 7. Two types of hybrid structure-based models using CNNs and transformers: (a) the framework of HiT integrating CNN and Transformer by hybrid strategy in a single branch (Yang et al. Citation2022) and (b) the framework of Hyper-LGNet integrating CNN and Transformer in a dual flow framework (T. Zhang et al. Citation2022).

In addition, STransFuse employed a staged model to extract different level feature representations at various semantic scales, and then fused the information from different stages by an adaptive fusion module (AFM) employing the self-attentive mechanism (Gao et al. Citation2021). The ablation studies demonstrated that employing AFM for feature map fusion performed 0.24% OA better in the Potsdam dataset than the baseline model. CCTNet combined the local details (e.g. edge and texture) and global context by the novel architecture coupled CNN and Transformer networks (H. Wang et al. Citation2022). UnetFormer is a scene segmentation model constructed for real-time urban scenarios, which remains top performance on the datasets of Potsdam and Vaihingen (L. Wang, Li, Zhang et al. Citation2022). By selecting the lightweight ResNet18 as encoder and developing an efficient global–local attention mechanism as Transformer-based decoder, UnetFormer could model both global and local information at the same time. On the contrary, another hybrid network adopted Swin transformer backbone to capture long-range dependencies by selecting a U-shaped decoder that employed ASPP block along with SE block to preserve local details of images (C. Zhang et al. Citation2022).

3.2.4. Computational complexity in different models

Given that the computational complexity measures the computational resources required for training and inference processes, it is one of the important relevant factors for model structure design. Generally related to the number of learnable parameters, the size of input data and the calculation operations of different layers, the computational complexity varied for different RS image segmentation models and significantly impacting their applicability on downstream tasks. For computational complexity measurement, FLOPs (Floating Point Operations) are a common index, calculating the number of floating-point operations performed during the forward propagation process. Higher FLOPs represent more computational resources required for models. The computational complexity of some typical methods are compared in via FLOPs index. Therefore, since different blocks and modules (e.g. CNN blocks and multi-head attention blocks) generally possess different FLOPs, it is of great necessity for a special design to make the design of DL-based RS image segmentation structure perform well with the balance between high precision and lightweight model.

Table 5. Computational complexity of some RS image segmentation models.

3.3. Label-efficient methods for segmentation model

Except for the aforementioned supervised approaches and model innovations, selecting optimal paradigm and training techniques also lead to a crucial impact on feature extraction and subsequent classification accuracy. In fact, though being generally adopted with good performance, the performance of prevailing methods utilizing CNN-based or Transformer-based architectures is sensitive to the quantity and composition of datasets. However, a large amount of per-pixel annotations for segmentation algorithms are expensive and labor-consuming especially for satellite images. Under this circumstance, DL algorithms have been conducted via label-efficient learning process to alleviate the annotation costs in recent years. By exploring the potential of unlabeled and weak-labeled data, such methods are promising techniques to handle label shortage challenge in sensing image segmentation. According to the type of supervision and the usage of annotations, label-efficient methods could be categorized into three common perspectives, namely transfer learning, semi-supervised learning, and self-supervised learning (Bandara, Nair, and Patel Citation2022; Kolbeinsson and Mikolajczyk Citation2022; Vincenzi et al. Citation2021; D. Zhang, Liu, and Shi Citation2020).

3.3.1. Transfer learning

With a popular paradigm to execute pre-training on standard datasets with massive data and then fine-tune for downstream tasks without abundant annotations, transfer learning could utilize feature representations from existing models that have already been trained. Such a method is capable of decreasing the training time and optimizing the learning efficiency of segmentation models. In the early study of CNNs utilized for remote sensing image classification, Castelluccio has compared the performance among three learning approaches, namely using pre-trained CNNs as feature extractors, fine tuning, and training from scratch (Castelluccio et al. Citation2015). The results demonstrated that fine-tuning better performs full training strategy when the dataset scale is small. Three common styles of fine-tuning including training all parameters, training only last few layers and training additional fully connected layers of model after loading pre-trained weights.

Therefore, by introducing transfer learning method in remote sensing field, Zhang employed EfficientNet pre-trained on ImageNet as backbone at initial stage, and fine tuned the fully connected layer of the lowest layer at the second stage (D. Zhang, Liu, and Shi Citation2020). Specifically, to fit in the dimensions of RS image datasets, the original softmax classification layer in EfficientNet was replaced by X-dimensional softmax classification layer. The idea of domain-specific transfer learning was also introduced by Panboonyuen to reuse the weights, performing knowledge transfer between RS datasets with different resolution (Panboonyuen et al. Citation2019). However, the domain difference (e.g. data properties and image pre-processing) between source and target data is supposed to be considered when adopting transfer learning method, which will have great effects on the performance of supervised training process. Generally, the pre-trained models perform better on similar datasets rather than the different ones.

3.3.2. Semi-supervised learning

Aiming at discussing an efficient way to make full use of a large amount of unlabeled data for model improvement, semi-supervised learning is especially suitable for datasets with limited labels. Following previous works (Y. Chen et al. Citation2024; Jiao et al. Citation2022), we divided the most representative semi-supervised image segmentation into two categories, namely self-training and consistency learning. Self-training methods train the segmentation models with labeled data first and then are used for generating pseudo labels of unlabeled data. By incorporating the pseudo labels into dataset for next training process, the self-training is iterative and could improve model continuously. With the framework that learning classifiers on different views of data sources, co-training provides another mechanism of self-training method. On the other hand, despite variations of prevalent methods, the basic principle of consistency learning methods focuses on enforcing models to have consistent outputs for the same sample after different transformations. Classic practices of consistency learning methods include mean teacher (Tarvainen and Valpola Citation2017), PseudoSeg (Y. Zou et al. Citation2020), and graph-based regularization (Kipf and Welling Citation2016).

For remote sensing image segmentation task, conducting dense annotation for each image into various categories is time consuming and costly. By contrast, the acquisition of original images is relatively simple with convenient platforms and sensors, resulting in the low utilization of collected data especially on satellite datasets. Therefore, remote sensing image segmentation models have been further explored combining with the semi-supervised learning paradigm. For example, consistency regularization and average update of pseudo-label were integrated to generate pseudo labels, hence a large amount of pseudo labels and limited strong labels could be sent into RS semantic segmentation models for training (J. Wang et al. Citation2020). Wang combined the semi-supervised method with an iterative contrastive network, realizing the potential information learning of pseudo RS labels (J.-X. Wang et al. Citation2022).

3.3.3. Self-supervised learning

Recently, by learning representative features from unlabeled data, self-supervised learning could achieve comparative performance with supervised pre-training on many tasks with less supervision (e.g. manual annotations). The general paradigm of self-supervised learning is illustrated in . Training with specially designed self-supervised pretexts, models are expected to capture high-level representations of unlabeled samples for supervised downstream tasks. Hence, such paradigm has become helpful for model training under the condition of unlabeled samples. The most commonly used pretexts mainly include three types, namely reconstructing via generative models, predicting a self-produced label from data augmentation and contrastive learning methods (Y. Wang et al. Citation2022).

Figure 8. The general paradigm of self-supervised learning and supervised learning: (a) Supervised Pipeline and (b) Self-Supervised Pipeline.

Figure 8. The general paradigm of self-supervised learning and supervised learning: (a) Supervised Pipeline and (b) Self-Supervised Pipeline.

Generative pre-training aims at learning representations of input images via reconstruction or generation procedure, so that enhancing the feature extraction capability of semantic segmentation models. Two prominent generative models are generative adversarial networks (GAN) and autoencoder. Except for these models, recent works have demonstrated the potential of a cutting-edge category of generative models applying to image segmentation pre-training, namely Denoising diffusion probabilistic models (DDPM) (Baranchuk et al. Citation2021; Graikos et al. Citation2022). Moreover, the feasibility of directly generating high-quality segmentation maps of input images via DDPM has been demonstrated in different fields of segmentation tasks, including medical images (Wolleb et al. Citation2022; Wu et al. Citation2022, Citation2023) and remote sensing images (Bandara, Nair, and Patel Citation2022; Kolbeinsson and Mikolajczyk Citation2022). Different from generative models, predictive self-supervised methods generally rely on the labels produced in different pretext tasks. There are a variety of pretext tasks designed to acquire context information of the input data (e.g. relative position prediction Doersch, Gupta, and Efros Citation2015, Jigsaw puzzles Noroozi and Favaro Citation2016, missing patches reconstruction K. He, Chen et al. Citation2022). However, the quality of pretext tasks in predictive self-supervised learning has a direct relation with the learning performance. To make a more general design, contrastive learning methods are proposed as an effective paradigm to learn high-level representations from unlabeled samples. As a significant milestone, the SimCLR proposed by Chen was a simple framework applying contrastive learning on visual representations (T. Chen et al. Citation2020). After that, enumerate architectures were proposed to achieve better self-supervised learning performance (Caron et al. Citation2020, Citation2021; T. Chen et al. Citation2020; Grill et al. Citation2020; K. He et al. Citation2020).

For remote sensing tasks, autoencoders have been widely adopted to learn representation from various remote sensing data, such as multi-spectral images, hyper-spectral images, and SAR images, while GAN-based methods are few for self-supervised pre-training strategy (Y. Wang et al. Citation2022). For example, a fully Conv–Deconv network was proposed to learn spectral–spatial features from hyperspectral images via the initial data reproduction (Mou, Ghamisi, and Xiang Zhu Citation2017). As powerful generative models, DDPMs have been proved to learn useful features from RS images containing high-level semantic information for downstream tasks (N. Chen et al. Citation2023) and also utilized to generate segmentation mask directly (Kolbeinsson and Mikolajczyk Citation2022). With the integration of predictive self-supervised methods utilized in remote sensing area, Zhao proposed a multi-task learning framework combining the tasks of rotation recognition and scene classification to enhance the generalization capability of CNN models (Z. Zhao et al. Citation2020). It is notable that for RS images with abundant spectral data, pretext tasks require specific design so as to relate the spectral bands. For example, Vincenzi exploited the potential of colorization pretext task on satellite imagery. By leveraging the high-dimensional spectral bands to reconstruct the visible colors, meaningful representations could be exploited (Vincenzi et al. Citation2021). At the same time, contrastive loss is also widely used in remote sensing image segmentation. Hou utilized a two-stage hyper-spectral imagery classification training strategy based on contrastive learning to make the most of unlabeled samples, aiming at alleviating the problem of insufficient label information in hyper-spectral data (Hou et al. Citation2021).

3.3.4. Foundation models

Vision foundation models have attracted increasing interests in recent years. Such general deep learning models are usually trained on large volumes of data and computing resources. Besides, the great innovations of training paradigms (e.g. the aforementioned self-supervised methods) underscore the great potential of extracting deep semantics from large-scale unlabeled datasets, so that obtaining a powerful foundation model with rich feature representation. It means that, by learning enormous number of images, visual foundation models are expected to capture general patterns and prior knowledge, demonstrating strong generalization ability in multiple tasks and scenarios. Therefore, such foundation models enable customized fine-tuning on the based structures and achieve promising generalization and sustainable improvement on a wide range of downstream applications.

Some notable frameworks have been released in recent years in the realm of natural images segmentation. As a remarkable milestone in visual representation learning, SAM achieved impressive zero-shot performance and also inspired numerous researches to exploit the potential of SAM applying on a variety of downstream tasks (Kirillov et al. Citation2023). Its design of promptable segmentation tasks has endowed itself with the ability to respond appropriately to any prompt at inference time, leading to unprecedented success of foundation model pre-training mechanism. In addition, SEEM was proposed as a promptable and interactive model for segmenting everything everywhere all at once in an image, enabling diverse prompting tasks and rivaling the limited interaction types of SAM. With a unified encoder that encodes all visual and linguistic hints into a joint representation space, SEEM supported generic segmentation tasks (X. Zou et al. Citation2023).

For remote sensing images, there are also a variety of foundation models that built on the basis of abundant unlabeled RS data sources and innovative self-supervised paradigm, supporting effective pre-training for various types of segmentation applications by specific stakeholders. For example, considering the domain gap and the limited generalization capability of models between natural and RS scenes, Sun proposed an RS foundation model framework called RingMo trained with the mask image modeling method. Two classical visual Transformer architectures (ViT and Swin Transformer) were taken into account as the based structure for reconstruction tasks, and the learned encoder could be useful for various optical RS downstream tasks. A total of 2M images were utilized by RingMo to learn a more robust feature representation (X. Sun et al. Citation2023). Focusing on the research gap on spectral data, Hong created a general RS foundation model named SpectralGPT to handle spectral RS images (Hong, Zhang, Li, Li, Li et al. Citation2023). SpectralGPT was trained on a comprehensive dataset comprising over 1M spectral images from the Sentinel-2 satellite. With the combination of 3D generative pretrained transformer and multi-target reconstruction framework, SpectralGPT could capture spectrally sequential patterns and show its substantial potential in the downstream segmentation task.

4. Extensive applications by remote sensing image segmentation methods

Remote sensing image segmentation techniques have gained increasing attention in various applications recently. With similar algorithms adopted for image segmentation task, follow-up decision-makings are specially conducted in diverse domains for algorithm implementation in downstream tasks. In this section, we selected three main applications: geological applications, precision agriculture, hydrological applications, and one potential application: environmental protection for review. From a high-level perspective, two typical tasks in the geological applications are introduced firstly, namely land use/land cover classification and geological hazard assessment. One of the most interest regarding the applications is precision agriculture due to its wide use and promising performance. Precision agriculture using the combination of DL methods and remote sensing images is raising attention in recent years, with crop area classification and crop disease/pest detection being regarded as two popular downstream tasks. Furthermore, water body mapping and extraction is a crucial study in hydrology, while satellites or airborne systems are especially utilized for large-scale water assessment. Finally, as a significant task being emphasized recently, environmental protection especially carbon dioxide monitoring has been integrated with DL-based segmentation models to realize an efficient monitoring. The success of these four domains and other unmentioned applications has shown forwarding prospects of DL image segmentation methods applying on RS images and the follow-up introduction of these four applications is illustrated in .

Figure 9. The tree diagram of the applications of DL-based image segmentation methods.

Figure 9. The tree diagram of the applications of DL-based image segmentation methods.

4.1. Geological applications

The remote sensing technologies and deep learning methods have enabled better understanding of the geological information, which is beneficial for reaching the goal of the Earth's environment protection and land management.

Land use and land cover classification (LULC) task aims at conducting quantitative assessment of the land element variations on the Earth surface, which generally brought by environment changes and the human activities (Digra, Dhir, and Sharma Citation2022). It has been considered as one of the most efficient ways to explore the land transformation. Since these changes could influence human's behaviors and may lead to unpredictable geological hazard, a timely and precise exploration is of paramount to reflect human society and environment changes (Z. Ma and Mei Citation2021). Therefore, many modeling approaches have been addressed to cope with this task, including machine learning and deep learning methods, where a diagram is given in to display the general evaluation of typical methods for LULC task.

Figure 10. Diagram displaying the main approaches and general evaluation for LULC task.

Figure 10. Diagram displaying the main approaches and general evaluation for LULC task.

The overall framework of LULC task is presented in . As the spatial/spectral resolution of remote sensing sensors advancing, various remote sensing platforms have been competent for LULC survey with different scale requirement. For instance, multi-spectral/hyper-spectral images collected by satellites are multi-temporal availability with large spatial and abundant spectral resolution, becoming popular options for large scale land cover mapping. After feature engineering phase that removing redundant information from the input data, different machine learning classifiers are trained with refined data to achieve precise land cover classification. Popular algorithms include support vector machine (SVM) (Jain et al. Citation2018), random forest (RF) (Shi and Yang Citation2016), K-means algorithm (F. Zhang et al. Citation2016), and so on. As researched and summarized previously, SVM and RF performed better on the LULC classification compared to other machine learning techniques (L. Ma et al. Citation2017; Talukdar et al. Citation2020). In the experiments conducted in the previous works, RF could obtain the most high-precision LULC map on satellite images among six examined algorithms (Talukdar et al. Citation2020).

Figure 11. The overall framework of LULC task.

Figure 11. The overall framework of LULC task.

Recently, the improvement of deep learning methods and the expansion of remote sensing datasets enable the well-performed deep learning models to achieve more precise LULC results compared with machine learning methods. Specifically, various works adopt deep learning models to realize pixel-based classification (e.g. road, grass, built-up area) of remote sensing images. Kussul conducted one of the first attempts to design a multilevel DL architecture for land cover classification using multi-temporal multi-source satellite imagery. By building a hierarchy of local features, CNNs were demonstrated with better performance than MLP and RF (Kussul et al. Citation2017). To extract local and global context information in remote sensing images, Xu proposed a high-resolution context extraction network (HRCNet) (Z. Xu et al. Citation2020). Considering the lack of labelled datasets in remote sensing images, semi-supervised approaches have also been adopted with a pseudo-label generation and model pre-training methods (L. Li et al. Citation2023).

At the same time, the great potential of hyper-spectral/multi-spectral images has been exploited by research combining with DL-based methods. These data contain abundant information with two spatial dimensions (the width and height of channels) and one spectral dimension (number of channels) (L. Zhang et al. Citation2012). In particular, the spatial domain attains the morphological information while the spectral one enable different material distinction corresponding to each pixel on the images. Chen first explored the potentials of deep learning on hyper-spectral feature extraction, using single-layer and stacked multi-layer autoencoders to learn shallow and deep features of hyper-spectral data respectively (Y. Chen et al. Citation2014). Considering the multi-band nature of hyper-spectral images, DL-based feature selection approaches are utilized. By selecting the relevant attributes of data after removing the redundancy of spectral channels, it is widely used in processing massive high-dimensional remote sensing data. Feng proposed a convolutional neural network based on band independent convolution and hard threshold (J. Feng et al. Citation2020). It is an end-to-end trainable network combining band selection, feature extraction, and classification. Based on the attention mechanism, He jointly extracted local/global information from both spatial and spectral aspects (K. He, Sun et al. Citation2022). Besides, a designed evaluation index was combined with information entropy and correlation coefficient to select the appropriate frequency band. However, there are still some problems in recent researches on feature selection for hyper-spectral images. Although the operation of feature selection mapped the original spectrum to a low dimensional space, interference and redundant spectra still affect the mapping process. At the same time, using complex DL algorithm to process will lead to the loss of original forms and patterns, making it difficult to seek physical explanations of images.

Geological hazards assessment is another significant application in geological domain. Geological hazards generally identified as the abnormal activity of geological environment that could result in serious disorder affecting the development of human society. Typical hazards include internal earthquakes, volcanic activity, and related geophysical processes, among which landslides are the most common geological hazards (Z. Ma and Mei Citation2021). Considering that the early landslide investigation is conducted through visual interpretation via manual expert knowledge and experimental surveys, it is undoubtedly time-consuming and costly. Therefore, automatic methods based on DL algorithms are also desirable, exploring the accurate identification of the landslide location by RS images for practical applications. For example, to extract potential active landslides in Synthetic Aperture Radar Interferometry (InSAR) imagery, Ma designed an end-to-end segmentation network called deep residual shrinkage U-Net (DRs-UNet) (Z. Ma and Mei Citation2021). The proposed network was also applied to detect landslides of test area located in Zhongxinrong County along Jinsha River. And the experimental results demonstrated the efficiency for its automatic recognition of potential landslide hazards. In addition, regarding the high cost of manual labeling using supervised segmentation methods, Zhou researched on a weakly supervised learning approach (Y. Zhou et al. Citation2022). With the combination of the class activation maps and cycle generative adversarial network, it could segment the RS landslide images in a precise manner. Furthermore, Jiang presented two methods, namely mask region-based convolutional neural networks and transfer learning Mask R-CNN, to detect and segment new and old landslides along the Sichuan-Tibet Transportation Corridor with the constructed datasets (Jiang et al. Citation2022). For other types of hazards assessment, Guerrero used visible high-resolution images combined with CNN based models for volcanic monitoring focusing on ash plumes segmentation, so that measuring the height of the plume to understand the magnitude of the explosion (Guerrero Tello et al. Citation2022). Moreover, based on UAV images, Xiong utilized CNN-based networks for building seismic damage assessment to improve the prediction accuracy for effective emergency after earthquake (Xiong, Li, and Lu Citation2020).

4.2. Precision agriculture

By optimizing the correct amount of agricultural inputs, precision agriculture (PA) is a critical component of sustainable agricultural systems aiming at improving productivity and increasing crop yields. In recent years, the development of artificial intelligence technologies enable agricultural producers to analyze and predict various factors, so as to make accurate management decision in agricultural production process. Combined with massive data, UAV devices, and remote sensing technology, DL-based models could collect diverse agricultural data, including meteorological data, soil data, and crop growth data. Taking advantage of rich data and powerful sensing capability, a comprehensive analysis could be realized to assist in crop field monitoring, thereby improving production efficiency. At the same time, DL-based models are also capable of crop disease/pest detection, so that agricultural stakeholders could have a timely understanding about the growth status of crops, soil conditions, and potential risks of pests and diseases. In conclusion, a series of applications are commonly conducted in PA, including crop area classification, crop vitality, water stress, and crop disease detection.

Crop area classification is first conducted to obtain a timely and accurate spatial distribution of crop area. Considering the large quantity of crop types with significant changes in agricultural crop lands, land resources monitoring is undoubtedly necessary. Recently, DL-based image semantic segmentation methods are combined with remote sensing images to identify different types of crop, so that achieving the management of land resources in designated areas to ensure the development of agricultural production. For example, the precise identification of crop area images from drone RS systems provides decision-making for drone seeding, crop fertilization, irrigation, and pesticide spraying operations, achieving the goals of reasonable planning for precision agriculture. Kussul proposed a multilevel DL architecture to classify crop types from satellite images, obtaining the accuracy of more than 85% on target crops including wheat, maize, sunflower, soybeans, and sugar beet (Kussul et al. Citation2017). Wang developed a segmentation approach based on the UNet++ architecture to classify ten categories on Sentinel-2A images with 10 m resolution from 2019 to 2021 (L. Wang, Wang et al. Citation2022). Transfer learning approach was introduced by Viskovic to alleviate the limited number of high-resolution hyperspectral annotations (Viskovic, Kosovic, and Mastelic Citation2019).

After realizing the classification of crop types, the indexes about crops that reflect crop vitality and water stress are expected to obtain for further understandings. Specifically, the recent development of PA has led to increasing demands for rapid recognition of crop vitality indexes, serving as powerful technical supports for yield prediction, disaster assessment, and other related works. Besides, an accurate and efficient recognition of crop water stress could maximize the efficiency of agricultural irrigation. By calculating the temperature difference between crop canopy and atmosphere, the Crop Water Stress Index (CWSI) indicates the water stress status of crops effectively. It is of great significance to construct CWSI-based model for a real-time monitoring of crop water conditions so that providing precise guidance of crop irrigation. Combining UAV remote sensing platforms with thermal infrared sensors are popular approaches to obtain regional surface temperature, hence determining regional water conditions and crop water stress.

With a relatively comprehensive understanding about crop area and crop types, the obtained information could support further PA applications. Crop disease and pest damage is one of the most severe challenges for global food security, thus requiring accurate detection to reduce their spread for precision agriculture management. However, great difficulties inevitably appear when realizing plant diseases and pests detection in complex natural environment. For instance, the differences between lesion area and the background are generally tiny. Meanwhile, the scales and characteristics vary in different types of lesion areas. Besides, inevitable noise appears in RS images (J. Liu and Wang Citation2021), leading to difficult procedures to describe the symptoms of diseases by computers. Therefore, crop disease and pest damage tend to be detected by manual process rather than via mathematical calculation at the previous times. Recently, due to the strong capability of feature extraction, DL-based segmentation methods have been increasingly applied in the field of crop disease monitoring (T. Zhang et al. Citation2021). With effectively recognizing different characteristics of plant diseases and pests, DL algorithms could directly achieve efficient detection of various diseases on diverse scales as presented in (e.g. wheat stripe rust, wheat yellow rust, tomato disease, corn leaf disease, etc.). Besides, the related research have demonstrated that, by using image recognition technology, accurate disease/pest diagnosis results and advanced prevention could be provided. As an example, DL-based methods are proposed to detect wheat yellow rust disease. Utilizing UAV multi-spectral images, Zhang proposed a real-time yellow rust detection algorithm named Efficient Dual Flow UNet (DF-UNet) to detect different levels of yellow rust, meeting practical requirements for a fast and robust algorithm (T. Zhang et al. Citation2022). Su exploited aerial visual perception for yellow rust disease monitoring by integrating unmanned aerial vehicle sensing, multispectral imaging, and deep learning U-Net (Su et al. Citation2021). Abbas trained a DenseNet121 model with transfer learning method for tomato disease detection (Abbas et al. Citation2021). Based on PSPNet semantic segmentation model, Pan identified wheat yellow rust from unmanned aerial vehicle images and experimentally compared the predicted performance with other machine learning and deep learning methods (Q. Pan et al. Citation2021).

Figure 12. Some applications of crop diseases recognition in RS images.

Figure 12. Some applications of crop diseases recognition in RS images.

4.3. Hydrological applications

Survey data showed that only about 2.5% of the total amount of water on the Earth constitutes the freshwater, and 1.5% of which could be accessed for various biophysical processes (Chawla, Karthikeyan, and Mishra Citation2020). Therefore, the intensive study and sustainable management of water resources are vitally important for ecologic health. Remote sensing techniques are also combined with advanced DL techniques and utilized for hydrological applications.

Water body mapping and extraction denotes automatically extracting water bodies from remote sensing images, which is a crucial study in hydro-geology for sustainable development of regional ecosystems. With the increasing need of large area monitoring, remote sensing satellites and airborne systems with multi-spectral and hyper-spectral sensors have become popular approaches for water body mapping and extraction due to their well properties. Over a long period of time, research have been conducted about exploiting the potential of remote sensing images utilized for researching the natural resources like water body segmentation. After collecting images with multiple bands, image processing methods are adopted, which could be divided into three categories, including threshold-based methods, index-based methods, and learning-based methods. A simple diagram of these three methods is illustrated in .

Figure 13. Comparison diagram of algorithm structure for (a) threshold-based, (b) index-based, and (c) learning-based methods.

Figure 13. Comparison diagram of algorithm structure for (a) threshold-based, (b) index-based, and (c) learning-based methods.

Threshold segmentation utilized a threshold in one or more bands to distinguish water body and other objects (Guo et al. Citation2020). For the band selection, the near infrared (NIR) region has been proved to be the most suitable band in threshold segmentation (Bijeesh and Narasimhamurthy Citation2020). For example, Bolanos conducted a threshold-based procedure to extract water region from SAR imagery (Bolanos et al. Citation2016). Comparing with the threshold-based methods, index-based methods developed the extraction technology by calculating the band ratios namely water body indexes instead of using the band pixel information directly. The mostly utilized water body indexes are normalized difference water index (NDWI), modified NDWI (MNDWI), and automated water extraction index (AWEI). Based on these indexes, various research were conducted and improved the classification performance to some extent. Ali detected the urban surface water bodies by using Landsat OLI TIRS and NDWI (Ali et al. Citation2019). Recently, NDWI was integrated with unsupervised deep learning methods by Li for accurately water extraction (J. Li, Meng et al. Citation2022).

However, there are some difficulties exist in threshold-based methods and index-based methods as the optimal thresholds vary across regions, restricting the increasing utilization of large-scale multi-spectral images. Therefore, learning-based algorithms are utilized to solve the water extraction problem with the help of remote sensing images. Previous study has demonstrated that, with samples increasing, the performance order of optimal machine learning models could be ranked as support vector machine, neural network, random forest, decision tree, and XGBoost (A. Li et al. Citation2021). Recently, deep learning methods have also become a research hot topic applied on water extraction and mapping. Popular networks, such as CNN, FCN, U-Net, and variations of CNN-based networks, were adopted to achieve water-body extraction (Basaeed, Bhaskar, and Al-Mualla Citation2016; W. Feng et al. Citation2018; L. Li et al. Citation2019). Long recommended a structure integrated with a CNN and logistic regression classifiers for water body extraction (L. Yu et al. Citation2017). Isikdogan proposed a fully convolutional neural network named Deep-WaterMap to segment water on Landsat imagery (Isikdogan, Bovik, and Passalacqua Citation2017). Feng presented an enhanced deep convolutional encoder–decoder network based on Deep U-Net to extract water resources from remote sensing images (W. Feng et al. Citation2018). In 2020, Wang proposed a novel CNN structure based on DenseNet (G. Huang et al. Citation2017) to identify water in the region of Poyang Lake (G. Wang et al. Citation2020). Recently, Xiang adopted a dense pyramid pooling module (DensePPM) to capture global contexts for segmenting water bodies from aerial images. The DensePPM achieved a 5.35% increase in the U-Net-based ablation experiment and reached the state-of-the-art performance compared with previous models (Xiang et al. Citation2023).

4.4. Environmental protection

Since environmental problem raising up attention around the world, a number of correlational researches (some examples are listed in ) have been conducted in recent times. Carbon dioxide (CO2) pollution, one of the critical parts, has caused a wide range of adverse effects on human health, environment conservancy, and increasing global warming trends (Brunekreef and Holgate Citation2002; Manisalidis et al. Citation2020; Ramanathan and Feng Citation2009). And it could lead to unpredictable destroy for the Earth, such as biodiversity reduction, sea level rise and climate change in the ecological environment. Therefore, the changes of atmospheric CO2 concentration are supposed to be identified and quantified for valid emission monitoring, including its distribution and diffusion mechanisms on spatial and temporal dimensions.

Figure 14. Environmental research using remote sensing images: (a) segmentation of weakly visible environmental microorganism images (Kulwa et al. Citation2023), (b) estimation of Ground-Level NO2 Pollution (Scheibenreif, Mommert, and Borth Citation2022), and (c) Hotspot for atmospheric CO2 concentrations retrieved from Greenhouse Gases Observing Satellite data (Crivelari-Costa et al. Citation2023).

Figure 14. Environmental research using remote sensing images: (a) segmentation of weakly visible environmental microorganism images (Kulwa et al. Citation2023), (b) estimation of Ground-Level NO2 Pollution (Scheibenreif, Mommert, and Borth Citation2022), and (c) Hotspot for atmospheric CO2 concentrations retrieved from Greenhouse Gases Observing Satellite data (Crivelari-Costa et al. Citation2023).

Different from some real-time monitoring for environmental protection that mostly conducted on the ground, the monitoring of CO2 concentration data is difficult to retrieve in time. Consequently, satellite hyper-spectral sensing systems are selected as appropriate methods for carbon emission monitoring by providing a fast access to carbon emissions-related data. For sensors to detect atmospheric CO2, early research mainly depended on the thermal infrared band that is sensitive to the change of the mid high tropospheric CO2. On the contrary, given that shortwave infrared band is more sensitive to the changes near the ground, it could be utilized to achieve CO2 concentration retrieval at the bottom of the atmosphere. Since the greenhouse effect is mainly attributed to CO2 that contribute about 70% to global warming (S. Zhou et al. Citation2023), it has come to a conclusion that reducing carbon dioxide is one of the top priorities for environmental protection works.

To better integrate and analyze the remote sensing data, deep learning methods such as neural networks with different structures are adopted for CO2 monitoring. For example, based on hyper-spectral remote sensing images, Zhang applied a new classification network to combine spatial and spectral information for discriminating carbon dioxide in sensing images, solving the problem that many noises would exist in the results when only using spectral information for classification (L. Zhang, Wang, and An Citation2020). Traditional remote sensing image methods such as ISODATA and SVM were also utilized for performance comparison. Towards the estimation of another common anthropogenic air pollutant, namely nitrogen dioxide (NO2) at high resolution with global coverage, Scheibenreif introduced deep learning utilizing multi-band imagery for air pollution estimation (Scheibenreif, Mommert, and Borth Citation2022). Considering other environmental protection works, DL-based algorithms are also adopted for efficient monitoring. As an example, the identification of proper environmental microorganisms (EMs) and their corresponding physiological characteristics via automatic image processing techniques are proposed to enhance the segmentation of weakly visible (namely transparent and noisy with low contrast) EM images (Kulwa et al. Citation2023).

5. Conclusion

This paper conducts a comprehensive review on remote sensing image segmentation by deep learning models, mainly concerning three aspects: the remote sensing systems integrated with different platforms, the algorithms proposed with architecture innovation and techniques advancement, and the wide applications in downstream tasks. In particular, observation platforms and sensors plus popular RS image datasets are first presented, with detailed analysis on the UAV-based aviation remote sensing and satellite-based space remote sensing systems. After that, challenges of deep learning applied on RS image segmentation are concluded. Then the classification of DL algorithms is performed according to the evolution order and model characteristics, mainly comprising of architecture designs (e.g. CNN, Transformer and hybrid structures) and advanced techniques except for supervised learning modes (e.g. transfer learning, self-supervised learning, and semi-supervised learning). The analysis and comparison are itemized for each category, leading to a general conclusion that DL algorithms could make full use of both spectral and spatial information of RS images, thus presenting superiority over traditional methods. Besides, by extracting appropriate features for classification or segmentation tasks, DL-based models have turned to be reliable and achieved outstanding accuracy on wide range of applications. Especially in four domains mentioned previously (geological, agricultural, hydrological, and environmental applications). With a remarkable development of remote sensing imaging and AI technologies, great improvement of segmentation performance has been achieved in a precise and efficient manner.

However, significant challenges needed to be addressed for further researches and developments:

(1)

As summarized in our review, the limited number of labeled samples is one of the most bottlenecks in RS image analysis. Therefore, advanced techniques such as transfer learning, generative adversarial networks, and semi-supervised learning could contribute to solving the problem brought by limited labeled training samples. In addition, considering the abundant information lying in RS image data, including multiple bands of multi-spectral/hyper-spectral images and very high resolution images, research and innovations are expected to better interpret and represent high-level semantics in remote sensing images for downstream classification and segmentation tasks.

(2)

Multi-modal remote sensing image segmentation methods utilized multiple data sources obtained from different sensors (e.g. optical images, radar data, infrared images) could be further improved for generalization enhancement and precision improvement. Consequently, to make a more comprehensive utilization of these information, well-designed frameworks and novel fusion approaches are expected for the integration of deep semantics of multiple modalities. Besides, lightweight optimization is also a significant part responding to the computational requirements in practical applications.

(3)

Since increasing interests have been focused on foundation models in the field of visual representation learning, the remote sensing visual foundation models are also expected to have various kinds of applications on different sub-fields of RS scenes. Model structures, training paradigm of large-scale unlabeled data and the ever-growing training resources could be evolved to increase the flexibility of RS image segmentation foundation models for different input data and task requirements.

(4)

For all kinds of RS images, inaccurate edge is a main problem to tackle with, which mainly includes two aspects. On the one hand, regarding the cropping operation in the pre-process of large-scale images, the RS image patches used for model training are added with edges inevitably. The adverse impact of edges is supposed to be reduced for global interaction between different patches and achieve precise segmentation on the overall image. On the other hand, since expanding receptive fields of DL models could increase contextual information for segmentation but decrease detail features at the same time, this contradiction led to inaccurate recognition of edge details. Consequently, researchers could pay more attention to edge enhancement methods for detailed edge segmentation.

(5)

Prior knowledge, generally referring to the domain expertise, previous research and logical rules, is extremely helpful for the learning and optimization of neural networks. Therefore, exploring an effective way to introduce the prior knowledge into DL learning models is promising in RS image segmentation domain.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Additional information

Funding

This work was supported by the Natural Science Foundation of China under Grant 42201386, in part by the Fundamental Research Funds for the Central Universities and the Youth Teacher International Exchange and Growth Program of USTB (QNXM20220033), and Interdisciplinary Research Project for Young Teachers of USTB (Fundamental Research Funds for the Central Universities: FRF-IDRY-22-018).

References

  • Abbas, Amreen, Sweta Jain, Mahesh Gour, and Swetha Vankudothu. 2021. “Tomato Plant Disease Detection Using Transfer Learning with C-GAN Synthetic Images.” Computers and Electronics in Agriculture 187:106279. https://doi.org/10.1016/j.compag.2021.106279.
  • Aleissaee, Abdulaziz Amer, Amandeep Kumar, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal, Gui-Song Xia, and Fahad Shahbaz Khan. 2023. “Transformers in Remote Sensing: A Survey.” Remote Sensing 15 (7): 1860. https://doi.org/10.3390/rs15071860.
  • Ali, Muhammad Ichsan, Gufran Darma Dirawan, Abdul Hafid Hasim, and Muh Rais Abidin. 2019. “Detection of Changes in Surface Water Bodies Urban Area with NDWI and MNDWI Methods.” International Journal on Advanced Science Engineering Information Technology 9 (3): 946–951. https://doi.org/10.18517/ijaseit.9.3.8692.
  • Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. 2017. “Segnet: A Deep Convolutional Encoder–Decoder Architecture for Image Segmentation.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12): 2481–2495. https://doi.org/10.1109/TPAMI.34.
  • Bai, Lubin, Shihong Du, Xiuyuan Zhang, Haoyu Wang, Bo Liu, and Song Ouyang. 2022. “Domain Adaptation for Remote Sensing Image Semantic Segmentation: An Integrated Approach of Contrastive Learning and Adversarial Learning.” IEEE Transactions on Geoscience and Remote Sensing 60:1–13.
  • Bandara, Wele Gedara Chaminda, Nithin Gopalakrishnan Nair, and Vishal M. Patel. 2022. “DDPM-CD: Remote Sensing Change Detection using Denoising Diffusion Probabilistic Models.” arXiv preprint arXiv: 2206.11892.
  • Baranchuk, Dmitry, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. 2021. “Label-Efficient Semantic Segmentation with Diffusion Models.” arXiv preprint arXiv: 2112.03126.
  • Basaeed, Essa, Harish Bhaskar, and Mohammed Al-Mualla. 2016. “Supervised Remote Sensing Image Segmentation Using Boosted Convolutional Neural Networks.” Knowledge-Based Systems 99:19–27. https://doi.org/10.1016/j.knosys.2016.01.028.
  • Bijeesh, T. V., and K. N. Narasimhamurthy. 2020. “Surface Water Detection and Delineation Using Remote Sensing Images: A Review of Methods and Algorithms.” Sustainable Water Resources Management 6 (1): 1–23. https://doi.org/10.1007/s40899-020-00425-4.
  • Bolanos, Sandra, Doug Stiff, Brian Brisco, and Alain Pietroniro. 2016. “Operational Surface Water Detection and Monitoring Using Radarsat 2.” Remote Sensing 8 (4): 285. https://doi.org/10.3390/rs8040285.
  • Brunekreef, Bert, and Stephen T. Holgate. 2002. “Air Pollution and Health.” The Lancet 360 (9341): 1233–1242. https://doi.org/10.1016/S0140-6736(02)11274-8.
  • Caron, Mathilde, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments.” Advances in Neural Information Processing Systems 33:9912–9924.
  • Caron, Mathilde, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. “Emerging Properties in Self-Supervised Vision Transformers.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
  • Castelluccio, Marco, Giovanni Poggi, Carlo Sansone, and Luisa Verdoliva. 2015. “Land Use Classification in Remote Sensing Images by Convolutional Neural Networks.” arXiv preprint arXiv: 1508.00092.
  • Chawla, Ila, L Karthikeyan, and Ashok K. Mishra. 2020. “A Review of Remote Sensing Applications for Water Security: Quantity, Quality, and Extremes.” Journal of Hydrology 585:124826. https://doi.org/10.1016/j.jhydrol.2020.124826.
  • Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. “A Simple Framework for Contrastive Learning of Visual Representations.” In International Conference on Machine Learning, 1597–1607.
  • Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. “Big Self-Supervised Models are Strong Semi-Supervised Learners.” Advances in Neural Information Processing Systems 33:22243–22255.
  • Chen, Yushi, Zhouhan Lin, Xing Zhao, Gang Wang, and Yanfeng Gu. 2014. “Deep Learning-Based Classification of Hyperspectral Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6): 2094–2107. https://doi.org/10.1109/JSTARS.4609443.
  • Chen, Yanbei, Massimiliano Mancini, Xiatian Zhu, and Zeynep Akata. 2024. “Semi-Supervised and Unsupervised Deep Visual Learning: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (3): 1327–1347.
  • Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2014. “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.” arXiv preprint arXiv: 1412.7062.
  • Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. “Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4): 834–848. https://doi.org/10.1109/TPAMI.2017.2699184.
  • Chen, Liang-Chieh, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. “Rethinking Atrous Convolution for Semantic Image Segmentation.” arXiv preprint arXiv: 1706.05587.
  • Chen, Ning, Jun Yue, Leyuan Fang, and Shaobo Xia. 2023. “SpectralDiff: A Generative Framework for Hyperspectral Image Classification With Diffusion Models.” IEEE Transactions on Geoscience and Remote Sensing 61:1–16.
  • Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. “Encoder–Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” In Proceedings of the European Conference on Computer Vision (ECCV), 801–818.
  • Crivelari-Costa, Patrícia Monique, Mendelson Lima, Newton La Scala Jr, Fernando Saragosa Rossi, João Lucas Della-Silva, Ricardo Dalagnol, Paulo Eduardo Teodoro, et al. 2023. “Changes in Carbon Dioxide Balance Associated with Land Use and Land Cover in Brazilian Legal Amazon Based on Remotely Sensed Imagery.” Remote Sensing 15 (11): 2780.
  • Cui, Binge, Xin Chen, and Yan Lu. 2020. “Semantic Segmentation of Remote Sensing Images Using Transfer Learning and Deep Convolutional Neural Network with Dense Connection.” IEEE Access 8: 116744–116755. https://doi.org/10.1109/Access.6287639.
  • Diakogiannis, Foivos I., François Waldner, Peter Caccetta, and Chen Wu. 2020. “ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data.” ISPRS Journal of Photogrammetry and Remote Sensing 162:94–114. https://doi.org/10.1016/j.isprsjprs.2020.01.013.
  • Digra, Monia, Renu Dhir, and Nonita Sharma. 2022. “Land Use Land Cover Classification of Remote Sensing Images Based on the Deep Learning Approaches: A Statistical Analysis and Review.” Arabian Journal of Geosciences 15 (10): 1003. https://doi.org/10.1007/s12517-022-10246-8.
  • Ding, Lei, Hao Tang, and Lorenzo Bruzzone. 2020. “LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 59 (1): 426–435. https://doi.org/10.1109/TGRS.36.
  • Doersch, Carl, Abhinav Gupta, and Alexei A. Efros. 2015. “Unsupervised Visual Representation Learning by Context Prediction.” In Proceedings of the IEEE International Conference on Computer Vision, 1422–1430.
  • Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” arXiv preprint arXiv: 2010.11929.
  • Drusch, Matthias, Umberto Del Bello, Sébastien Carlier, Olivier Colin, Veronica Fernandez, Ferran Gascon, Bianca Hoersch. 2012. “Sentinel-2: ESA's Optical High-Resolution Mission for GMES Operational Services.” Remote Sensing of Environment 120:25–36. https://doi.org/10.1016/j.rse.2011.11.026.
  • Everaerts, Jurgen. 2008. “The Use of Unmanned Aerial Vehicles (UAVs) for Remote Sensing and Mapping.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 37 (2008): 1187–1192.
  • Feng, Jie, Jiantong Chen, Qigong Sun, Ronghua Shang, Xianghai Cao, Xiangrong Zhang, and Licheng Jiao. 2020. “Convolutional Neural Network Based on Bandwise-Independent Convolution and Hard Thresholding for Hyperspectral Band Selection.” IEEE Transactions on Cybernetics 51 (9): 4414–4428. https://doi.org/10.1109/TCYB.2020.3000725.
  • Feng, Wenqing, Haigang Sui, Weiming Huang, Chuan Xu, and Kaiqiang An. 2018. “Water Body Extraction From Very High-Resolution Remote Sensing Imagery Using Deep U-Net and a Superpixel-Based Conditional Random Field Model.” IEEE Geoscience and Remote Sensing Letters 16 (4): 618–622. https://doi.org/10.1109/LGRS.2018.2879492.
  • Fu, Jun, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. “Dual Attention Network for Scene Segmentation.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3146–3154.
  • Gao, Liang, Hui Liu, Minhang Yang, Long Chen, Yaling Wan, Zhengqing Xiao, and Yurong Qian. 2021. “STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:10990–11003. https://doi.org/10.1109/JSTARS.2021.3119654.
  • Graikos, Alexandros, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. 2022. “Diffusion Models as Plug-and-Play Priors.” arXiv preprint arXiv: 2206.09012.
  • Grill, Jean-Bastien, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch. 2020. “Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning.” Advances in Neural Information Processing Systems 33:21271–21284.
  • Guerrero Tello, José Francisco, Mauro Coltelli, Maria Marsella, Angela Celauro, and José Antonio Palenzuela Baena. 2022. “Convolutional Neural Network Algorithms for Semantic Segmentation of Volcanic Ash Plumes Using Visible Camera Imagery.” Remote Sensing 14 (18): 4477. https://doi.org/10.3390/rs14184477.
  • Guo, Hongxiang, Guojin He, Wei Jiang, Ranyu Yin, Lei Yan, and Wanchun Leng. 2020. “A Multi-Scale Water Extraction Convolutional Neural Network (MWEN) Method for GaoFen-1 Remote Sensing Images.” ISPRS International Journal of Geo-Information 9 (4): 189. https://doi.org/10.3390/ijgi9040189.
  • Hamaguchi, Ryuhei, Aito Fujita, Keisuke Nemoto, Tomoyuki Imaizumi, and Shuhei Hikosaka. 2018. “Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery.” In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 1442–1450.
  • Hang, Renlong, Ping Yang, Feng Zhou, and Qingshan Liu. 2022. “Multiscale Progressive Segmentation Network for High-Resolution Remote Sensing Imagery.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–12.
  • Hartling, Sean, Vasit Sagan, and Maitiniyazi Maimaitijiang. 2021. “Urban Tree Species Classification Using UAV-based Multi-Sensor Data Fusion and Machine Learning.” GIScience & Remote Sensing 58 (8): 1250–1275. https://doi.org/10.1080/15481603.2021.1974275.
  • He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. “Masked Autoencoders are Scalable Vision Learners.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009.
  • He, Kaiming, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. “Momentum Contrast for Unsupervised Visual Representation Learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
  • He, Ke, Weiwei Sun, Gang Yang, Xiangchao Meng, Kai Ren, Jiangtao Peng, and Qian Du. 2022. “A Dual Global–local Attention Network for Hyperspectral Band Selection.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–13.
  • He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
  • He, Ji, Lina Zhao, Hongwei Yang, Mengmeng Zhang, and Wei Li. 2019. “HSI-BERT: Hyperspectral Image Classification Using the Bidirectional Encoder Representation from Transformers.” IEEE Transactions on Geoscience and Remote Sensing 58 (1): 165–178. https://doi.org/10.1109/TGRS.36.
  • Hong, Danfeng, Lianru Gao, Naoto Yokoya, Jing Yao, Jocelyn Chanussot, Qian Du, and Bing Zhang. 2020. “More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification.” IEEE Transactions on Geoscience and Remote Sensing 59 (5): 4340–4354. https://doi.org/10.1109/TGRS.2020.3016820.
  • Hong, Danfeng, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. 2021. “SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers.” IEEE Transactions on Geoscience and Remote Sensing 60:1–15. https://doi.org/10.1109/TGRS.2022.3172371.
  • Hong, Danfeng, Jing Yao, Chenyu Li, Deyu Meng, Naoto Yokoya, and Jocelyn Chanussot. 2023. “Decoupled-and-Coupled Networks: Self-Supervised Hyperspectral Image Super-Resolution With Subpixel Fusion.” IEEE Transactions on Geoscience and Remote Sensing 61:1–12. https://doi.org/10.1109/TGRS.2023.3324497.
  • Hong, Danfeng, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, et al. 2023. “Spectralgpt: Spectral Foundation Model.” arXiv preprint arXiv: 2311.07113.
  • Hong, Danfeng, Bing Zhang, Hao Li, Yuxuan Li, Jing Yao, Chenyu Li, Martin Werner, et al. 2023. “Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation Using High-Resolution Domain Adaptation Networks.” Remote Sensing of Environment299:113856. https://doi.org/10.1016/j.rse.2023.113856.
  • Hou, Sikang, Hongye Shi, Xianghai Cao, Xiaohua Zhang, and Licheng Jiao. 2021. “Hyperspectral Imagery Classification Based on Contrastive Learning.” IEEE Transactions on Geoscience and Remote Sensing 60:1–13.
  • Hu, Jie, Li Shen, and Gang Sun. 2018. “Squeeze-and-Excitation Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141.
  • Huang, Liwei, Bitao Jiang, Shouye Lv, Yanbo Liu, and Ying Fu. 2023. “Deep Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 1–28.
  • Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. “Densely Connected Convolutional Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708.
  • Huang, Zilong, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. “Ccnet: Criss-Cross Attention for Semantic Segmentation.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 603–612.
  • Isikdogan, Furkan, Alan C. Bovik, and Paola Passalacqua. 2017. “Surface Water Mapping by Deep Learning.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10 (11): 4909–4918. https://doi.org/10.1109/JSTARS.4609443.
  • Jain, Deepak Kumar, Surendra Bilouhan Dubey, Rishin Kumar Choubey, Amit Sinhal, Siddharth Kumar Arjaria, Amar Jain, and Haoxiang Wang. 2018. “An Approach for Hyperspectral Image Classification by Optimizing SVM Using Self Organizing Map.” Journal of Computational Science 25:252–259. https://doi.org/10.1016/j.jocs.2017.07.016.
  • Jang, GyuJin, Jaeyoung Kim, Ju-Kyung Yu, Hak-Jin Kim, Yoonha Kim, Dong-Wook Kim, Kyung-Hwan Kim, Chang Woo Lee, and Yong Suk Chung. 2020. “Cost-Effective Unmanned Aerial Vehicle (UAV) Platform for Field Plant Breeding Application.” Remote Sensing 12 (6): 998. https://doi.org/10.3390/rs12060998.
  • Jiang, Wandong, Jiangbo Xi, Zhenhong Li, Minghui Zang, Bo Chen, Chenglong Zhang, Zhenjiang Liu, Siyan Gao, and Wu Zhu. 2022. “Deep Learning for Landslide Detection and Segmentation in High-Resolution Optical Images Along the Sichuan-Tibet Transportation Corridor.” Remote Sensing 14 (21): 5490. https://doi.org/10.3390/rs14215490.
  • Jiao, Rushi, Yichi Zhang, Le Ding, Rong Cai, and Jicong Zhang. 2022. “Learning with Limited Annotations: A Survey on Deep Semi-Supervised Learning for Medical Image Segmentation.” arXiv preprint arXiv: 2207.14191.
  • Kampffmeyer, Michael, Arnt-Borre Salberg, and Robert Jenssen. 2016. “Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 680–688.
  • Keshk, Hatem Magdy, and Xu-Cheng Yin. 2017. “Satellite Super-Resolution Images Depending on Deep Learning Methods: A Comparative Study.” In 2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), 1–7.
  • Kim, Miae, Junghee Lee, Daehyeon Han, Minso Shin, Jungho Im, Junghye Lee, Lindi J. Quackenbush, and Zhu Gu. 2018. “Convolutional Neural Network-Based Land Cover Classification Using 2-D Spectral Reflectance Curve Graphs with Multitemporal Satellite Imagery.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (12): 4604–4617. https://doi.org/10.1109/JSTARS.4609443.
  • Kipf, Thomas N., and Max Welling. 2016. “Semi-Supervised Classification with Graph Convolutional Networks.” arXiv preprint arXiv: 1609.02907.
  • Kirillov, Alexander, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, et al. 2023. “Segment Anything.” arXiv preprint arXiv: 2304.02643.
  • Kolbeinsson, Benedikt, and Krystian Mikolajczyk. 2022. “Multi-Class Segmentation from Aerial Views using Recursive Noise Diffusion.” arXiv preprint arXiv: 2212.00787.
  • Kulwa, Frank, Chen Li, Marcin Grzegorzek, Md Mamunur Rahaman, Kimiaki Shirahama, and Sergey Kosov. 2023. “Segmentation of Weakly Visible Environmental Microorganism Images Using Pair-Wise Deep Learning Features.” Biomedical Signal Processing and Control 79:104168. https://doi.org/10.1016/j.bspc.2022.104168.
  • Kupidura, Przemysław. 2019. “The Comparison of Different Methods of Texture Analysis for Their Efficacy for Land Use Classification in Satellite Imagery.” Remote Sensing 11 (10): 1233. https://doi.org/10.3390/rs11101233.
  • Kussul, Nataliia, Mykola Lavreniuk, Sergii Skakun, and Andrii Shelestov. 2017. “Deep Learning Classification of Land Cover and Crop Types Using Remote Sensing Data.” IEEE Geoscience and Remote Sensing Letters 14 (5): 778–782. https://doi.org/10.1109/LGRS.2017.2681128.
  • LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. https://doi.org/10.1109/5.726791.
  • Li, Wenyuan, Hao Chen, and Zhenwei Shi. 2021. “Semantic Segmentation of Remote Sensing Images with Self-Supervised Multitask Representation Learning.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:6438–6450. https://doi.org/10.1109/JSTARS.2021.3090418.
  • Li, Aimin, Meng Fan, Guangduo Qin, Youcheng Xu, and Hailong Wang. 2021. “Comparative Analysis of Machine Learning Algorithms in Automatic Identification and Extraction of Water Boundaries.” Applied Sciences 11 (21): 10062. https://doi.org/10.3390/app112110062.
  • Li, Haifeng, Yi Li, Guo Zhang, Ruoyun Liu, Haozhe Huang, Qing Zhu, and Chao Tao. 2022. “Global and Local Contrastive Self-Supervised Learning for Semantic Segmentation of HR Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60:1–14.
  • Li, Junjie, Yizhuo Meng, Yuanxi Li, Qian Cui, Xining Yang, Chongxin Tao, Zhe Wang, Linyi Li, and Wen Zhang. 2022. “Accurate Water Extraction Using Remote Sensing Imagery Based on Normalized Difference Water Index and Unsupervised Deep Learning.” Journal of Hydrology 612: 128202. https://doi.org/10.1016/j.jhydrol.2022.128202.
  • Li, Liwei, Zhi Yan, Qian Shen, Gang Cheng, Lianru Gao, and Bing Zhang. 2019. “Water Body Extraction From Very High Spatial Resolution Remote Sensing Data Based on Fully Convolutional Networks.” Remote Sensing 11 (10): 1162. https://doi.org/10.3390/rs11101162.
  • Li, Jiangyun, Sen Zha, Chen Chen, Meng Ding, Tianxiang Zhang, and Hong Yu. 2022. “Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation.” IEEE Transactions on Image Processing 31:3211–3223. https://doi.org/10.1109/TIP.2022.3166673.
  • Li, Ying, Haokui Zhang, Xizhe Xue, Yenan Jiang, and Qiang Shen. 2018. “Deep Learning for Remote Sensing Image Classification: A Survey.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (6): e1264.
  • Li, Linhui, Wenjun Zhang, Xiaoyan Zhang, Mahmoud Emam, and Weipeng Jing. 2023. “Semi-Supervised Remote Sensing Image Semantic Segmentation Method Based on Deep Learning.” Electronics 12 (2): 348. https://doi.org/10.3390/electronics12020348.
  • Li, Rui, Shunyi Zheng, Ce Zhang, Chenxi Duan, Jianlin Su, Libo Wang, and Peter M. Atkinson. 2021. “Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60:1–13.
  • Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. “Feature Pyramid Networks for Object Detection.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117–2125.
  • Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022.
  • Liu, Jun, and Xuewei Wang. 2021. “Plant Diseases and Pests Detection Based on Deep Learning: A Review.” Plant Methods 17 (1): 1–18. https://doi.org/10.1186/s13007-020-00700-7.
  • Liu, Bing, Anzhu Yu, Kuiliang Gao, Xiong Tan, Yifan Sun, and Xuchu Yu. 2022. “DSS-TRM: Deep Spatial–spectral Transformer for Hyperspectral Image Classification.” European Journal of Remote Sensing 55 (1): 103–114. https://doi.org/10.1080/22797254.2021.2023910.
  • Liu, Wenjie, Yongjun Zhang, Haisheng Fan, Yongjie Zou, and Zhongwei Cui. 2020. “A New Multi-Channel Deep Convolutional Neural Network for Semantic Segmentation of Remote Sensing Image.” IEEE Access 8: 131814–131825. https://doi.org/10.1109/Access.6287639.
  • Liu, Yifan, Qigang Zhu, Feng Cao, Junke Chen, and Gang Lu. 2021. “High-Resolution Remote Sensing Image Segmentation Framework Based on Attention Mechanism and Adaptive Weighting.” ISPRS International Journal of Geo-Information 10 (4): 241. https://doi.org/10.3390/ijgi10040241.
  • Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440.
  • Lv, Jinna, Qi Shen, Mingzheng Lv, Yiran Li, Lei Shi, and Peiying Zhang. 2023. “Deep Learning-Based Semantic Segmentation of Remote Sensing Images: A Review.” Frontiers in Ecology and Evolution 11: 1201125. https://doi.org/10.3389/fevo.2023.1201125.
  • Ma, Lei, Manchun Li, Xiaoxue Ma, Liang Cheng, Peijun Du, and Yongxue Liu. 2017. “A Review of Supervised Object-Based Land-Cover Image Classification.” ISPRS Journal of Photogrammetry and Remote Sensing 130:277–293. https://doi.org/10.1016/j.isprsjprs.2017.06.001.
  • Ma, Zhengjing, and Gang Mei. 2021. “Deep Learning for Geological Hazards Analysis: Data, Models, Applications, and Opportunities.” Earth-Science Reviews 223: 103858. https://doi.org/10.1016/j.earscirev.2021.103858.
  • Manisalidis, Ioannis, Elisavet Stavropoulou, Agathangelos Stavropoulos, and Eugenia Bezirtzoglou. 2020. “Environmental and Health Impacts of Air Pollution: A Review.” Frontiers in Public Health 8:14. https://doi.org/10.3389/fpubh.2020.00014.
  • Martins, José Augusto Correa, Keiller Nogueira, Lucas Prado Osco, Felipe David Georges Gomes, Danielle Elis Garcia Furuya, Wesley Nunes Gonçalves, Diego André Sant'Ana. 2021. “Semantic Segmentation of Tree-Canopy in Urban Environment with Pixel-Wise Deep Learning.” Remote Sensing 13 (16): 3054. https://doi.org/10.3390/rs13163054.
  • Masek, Jeffrey G., Eric F. Vermote, Nazmi E. Saleous, Robert Wolfe, Forrest G. Hall, Karl Fred Huemmrich, Feng Gao, Jonathan Kutler, and Teng-Kui Lim. 2006. “A Landsat Surface Reflectance Dataset for North America, 1990–2000.” IEEE Geoscience and Remote Sensing Letters 3 (1): 68–72. https://doi.org/10.1109/LGRS.2005.857030.
  • Mou, Lichao, Pedram Ghamisi, and Xiao Xiang Zhu. 2017. “Unsupervised Spectral–Spatial Feature Learning Via Deep Residual Conv–Deconv Network for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 56 (1): 391–406. https://doi.org/10.1109/TGRS.2017.2748160.
  • Naseer, Muhammad Muzammal, Kanchana Ranasinghe, Salman H. Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021. “Intriguing Properties of Vision Transformers.” Advances in Neural Information Processing Systems 34:23296–23308.
  • Neupane, Bipul, Teerayut Horanont, and Jagannath Aryal. 2021. “Deep Learning-Based Semantic Segmentation of Urban Features in Satellite Images: A Review and Meta-Analysis.” Remote Sensing 13 (4): 808. https://doi.org/10.3390/rs13040808.
  • Nogueira, Keiller, Mauro Dalla Mura, Jocelyn Chanussot, William Robson Schwartz, and Jefersson Alex Dos Santos. 2019. “Dynamic Multicontext Segmentation of Remote Sensing Images Based on Convolutional Networks.” IEEE Transactions on Geoscience and Remote Sensing 57 (10): 7503–7520. https://doi.org/10.1109/TGRS.36.
  • Noor, Norzailawati Mohd, Alias Abdullah, and Mazlan Hashim. 2018. “Remote Sensing UAV/Drones and Its Applications for Urban Areas: A Review.” In IOP Conference Series: Earth and Environmental Science, 012003.
  • Noroozi, Mehdi, and Paolo Favaro. 2016. “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles.” In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VI, 69–84.
  • Osco, Lucas Prado, Mauro dos Santos De Arruda, José Marcato Junior, Neemias Buceli Da Silva, Ana Paula Marques Ramos, Érika Akemi Saito Moryia, Nilton Nobuhiro Imai. 2020. “A Convolutional Neural Network Approach for Counting and Geolocating Citrus-Trees in UAV Multispectral Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 160:97–106. https://doi.org/10.1016/j.isprsjprs.2019.12.010.
  • Pan, Qian, Maofang Gao, Pingbo Wu, Jingwen Yan, and Shilei Li. 2021. “A Deep-Learning-based Approach for Wheat Yellow Rust Disease Recognition From Unmanned Aerial Vehicle Images.” Sensors 21 (19): 6540. https://doi.org/10.3390/s21196540.
  • Pan, Yifan, Xianfeng Zhang, Guido Cervone, and Liping Yang. 2018. “Detection of Asphalt Pavement Potholes and Cracks Based on the Unmanned Aerial Vehicle Multispectral Imagery.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (10): 3701–3712. https://doi.org/10.1109/JSTARS.4609443.
  • Panboonyuen, Teerapong, Kulsawasd Jitkajornwanich, Siam Lawawirojwong, Panu Srestasathiern, and Peerapon Vateekul. 2019. “Semantic Segmentation on Remotely Sensed Images Using An Enhanced Global Convolutional Network with Channel Attention and Domain Specific Transfer Learning.” Remote Sensing 11 (1): 83. https://doi.org/10.3390/rs11010083.
  • Park, Namuk, and Songkuk Kim. 2022. “How Do Vision Transformers Work?” arXiv preprint arXiv: 2202.06709.
  • Ramanathan, Veerabhadran, and Yan Feng. 2009. “Air Pollution, Greenhouse Gases and Climate Change: Global and Regional Perspectives.” Atmospheric Environment 43 (1): 37–50. https://doi.org/10.1016/j.atmosenv.2008.09.063.
  • Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, 234–241.
  • Scheibenreif, Linus, Michael Mommert, and Damian Borth. 2022. “Toward Global Estimation of Ground-Level No 2 Pollution with Deep Learning and Remote Sensing.” IEEE Transactions on Geoscience and Remote Sensing 60:1–14. https://doi.org/10.1109/TGRS.2022.3160827.
  • Shao, Zhenfeng, Yin Pan, Chunyuan Diao, and Jiajun Cai. 2019. “Cloud Detection in Remote Sensing Images Based on Multiscale Features-Convolutional Neural Network.” IEEE Transactions on Geoscience and Remote Sensing 57 (6): 4062–4076. https://doi.org/10.1109/TGRS.36.
  • Shi, Di, and Xiaojun Yang. 2016. “An Assessment of Algorithmic Parameters Affecting Image Classification Accuracy by Random Forests.” Photogrammetric Engineering & Remote Sensing 82 (6): 407–417. https://doi.org/10.14358/PERS.82.6.407.
  • Su, Jinya, Dewei Yi, Baofeng Su, Zhiwen Mi, Cunjia Liu, Xiaoping Hu, Xiangming Xu, Lei Guo, and Wen-Hua Chen. 2021. “Aerial Visual Perception in Smart Farming: Field Study of Wheat Yellow Rust Monitoring.” IEEE Transactions on Industrial Informatics 17 (3): 2242–2249. https://doi.org/10.1109/TII.9424.
  • Su, Jinya, Xiaoyong Zhu, Shihua Li, and Wen-Hua Chen. 2023. “AI Meets UAVs: A Survey on AI Empowered UAV Perception Systems for Precision Agriculture.” Neurocomputing 518:242–270. https://doi.org/10.1016/j.neucom.2022.11.020.
  • Sun, Xian, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiaonan Lu, Qibin He, Junxi Li. 2023. “RingMo: A Remote Sensing Foundation Model With Masked Image Modeling.” IEEE Transactions on Geoscience and Remote Sensing 61:1–22.
  • Sun, Le, Guangrui Zhao, Yuhui Zheng, and Zebin Wu. 2022. “Spectral–spatial Feature Tokenization Transformer for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 60:1–14.
  • Talukdar, Swapan, Pankaj Singha, Susanta Mahato, Swades Pal, Yuei-An Liou, and Atiqur Rahman. 2020. “Land-Use Land-Cover Classification by Machine Learning Classifiers for Satellite Observations—A Review.” Remote Sensing 12 (7): 1135. https://doi.org/10.3390/rs12071135.
  • Tarvainen, Antti, and Harri Valpola. 2017. “Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results.” Advances in Neural Information Processing Systems 30: 1195–1204.
  • Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention is All You Need.” Advances in Neural Information Processing Systems 30: 5998–6008.
  • Vincenzi, Stefano, Angelo Porrello, Pietro Buzzega, Marco Cipriano, Pietro Fronte, Roberto Cuccu, Carla Ippoliti, Annamaria Conte, and Simone Calderara. 2021. “The Color Out of Space: Learning Self-Supervised Representations for Earth Observation Imagery.” In 2020 25th International Conference on Pattern Recognition (ICPR), 3034–3041.
  • Viskovic, Lucija, Ivana Nizetic Kosovic, and Toni Mastelic. 2019. “Crop Classification Using Multi-Spectral and Multitemporal Satellite Imagery with Machine Learning.” In 2019 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 1–5.
  • Wang, Yi, Conrad M. Albrecht, Nassim Ait Ali Braham, Lichao Mou, and Xiao Xiang Zhu. 2022. “Self-Supervised Learning in Remote Sensing: A Review.” arXiv preprint arXiv: 2206.13188.
  • Wang, Jia-Xin, Si-Bao Chen, Chris H. Q. Ding, Jin Tang, and Bin Luo. 2022. “Semi-Supervised Semantic Segmentation of Remote Sensing Images with Iterative Contrastive Network.” IEEE Geoscience and Remote Sensing Letters 19:1–5.
  • Wang, Hong, Xianzhong Chen, Tianxiang Zhang, Zhiyong Xu, and Jiangyun Li. 2022. “CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images.” Remote Sensing 14 (9): 1956. https://doi.org/10.3390/rs14091956.
  • Wang, Jiaxin, Chris H. Q. Ding, Sibao Chen, Chenggang He, and Bin Luo. 2020. “Semi-Supervised Remote Sensing Image Semantic Segmentation Via Consistency Regularization and Average Update of Pseudo-Label.” Remote Sensing 12 (21): 3603. https://doi.org/10.3390/rs12213603.
  • Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. “Non-Local Neural Networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803.
  • Wang, Libo, Rui Li, Chenxi Duan, Ce Zhang, Xiaoliang Meng, and Shenghui Fang. 2022. “A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images.” IEEE Geoscience and Remote Sensing Letters 19:1–5.
  • Wang, Libo, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, and Peter M. Atkinson. 2022. “UNetFormer: A UNet-Like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 190:196–214. https://doi.org/10.1016/j.isprsjprs.2022.06.008.
  • Wang, Wenxuan, Leiming Liu, Tianxiang Zhang, Jiachen Shen, Jing Wang, and Jiangyun Li. 2022. “Hyper-ES2T: Efficient Spatial–spectral Transformer for the Classification of Hyperspectral Remote Sensing Images.” International Journal of Applied Earth Observation and Geoinformation 113:103005. https://doi.org/10.1016/j.jag.2022.103005.
  • Wang, Lijun, Jiayao Wang, Xiwang Zhang, Laigang Wang, and Fen Qin. 2022. “Deep Segmentation and Classification of Complex Crops Using Multi-Feature Satellite Imagery.” Computers and Electronics in Agriculture 200:107249. https://doi.org/10.1016/j.compag.2022.107249.
  • Wang, Guojie, Mengjuan Wu, Xikun Wei, and Huihui Song. 2020. “Water Identification From High-Resolution Remote Sensing Images Based on Multidimensional Densely Connected Convolutional Neural Networks.” Remote Sensing 12 (5): 795. https://doi.org/10.3390/rs12050795.
  • Wang, Wenhai, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578.
  • Wójtowicz, Marek, Andrzej Wójtowicz, Jan Piekarczyk. 2016. “Application of Remote Sensing Methods in Agriculture.” Communications in Biometry and Crop Science 11 (1): 31–50.
  • Wolleb, Julia, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C. Cattin. 2022. “Diffusion Models for Implicit Image Segmentation Ensembles.” In International Conference on Medical Imaging with Deep Learning, 1336–1348.
  • Wu, Junde, Huihui Fang, Yu Zhang, Yehui Yang, and Yanwu Xu. 2022. “MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model.” arXiv preprint arXiv: 2211.00611.
  • Wu, Junde, Rao Fu, Huihui Fang, Yu Zhang, and Yanwu Xu. 2023. “MedSegDiff-V2: Diffusion based Medical Image Segmentation with Transformer.” arXiv preprint arXiv: 2301.11798.
  • Wurm, Michael, Thomas Stark, Xiao Xiang Zhu, Matthias Weigand, and Hannes Taubenböck. 2019. “Semantic Segmentation of Slums in Satellite Images Using Transfer Learning on Fully Convolutional Neural Networks.” ISPRS Journal of Photogrammetry and Remote Sensing 150:59–69. https://doi.org/10.1016/j.isprsjprs.2019.02.006.
  • Xiang, Deqiang, Xin Zhang, Wei Wu, and Hongbin Liu. 2023. “Denseppmunet-a: A Robust Deep Learning Network for Segmenting Water Bodies From Aerial Images.” IEEE Transactions on Geoscience and Remote Sensing 61:1–11.
  • Xiao, Yi, Yahui Guo, Guodong Yin, Xuan Zhang, Yu Shi, Fanghua Hao, and Yongshuo Fu. 2022. “UAV Multispectral Image-Based Urban River Water Quality Monitoring Using Stacked Ensemble Machine Learning Algorithms—A Case Study of the Zhanghe River, China.” Remote Sensing 14 (14): 3272. https://doi.org/10.3390/rs14143272.
  • Xiao, Tao, Yikun Liu, Yuwen Huang, Mingsong Li, and Gongping Yang. 2023. “Enhancing Multiscale Representations with Transformer for Remote Sensing Image Semantic Segmentation.” IEEE Transactions on Geoscience and Remote Sensing 61:1–16.
  • Xiong, Chen, Qiangsheng Li, and Xinzheng Lu. 2020. “Automated Regional Seismic Damage Assessment of Buildings Using An Unmanned Aerial Vehicle and a Convolutional Neural Network.” Automation in Construction 109:102994. https://doi.org/10.1016/j.autcon.2019.102994.
  • Xu, Yizhe, and Jie Jiang. 2022. “High-Resolution Boundary-Constrained and Context-Enhanced Network for Remote Sensing Image Segmentation.” Remote Sensing 14 (8): 1859. https://doi.org/10.3390/rs14081859.
  • Xu, Zhiyong, Weicun Zhang, Tianxiang Zhang, and Jiangyun Li. 2020. “HRCNet: High-Resolution Context Extraction Network for Semantic Segmentation of Remote Sensing Images.” Remote Sensing 13 (1): 71. https://doi.org/10.3390/rs13010071.
  • Xu, Zhiyong, Weicun Zhang, Tianxiang Zhang, Zhifang Yang, and Jiangyun Li. 2021. “Efficient Transformer for Remote Sensing Image Segmentation.” Remote Sensing 13 (18): 3585. https://doi.org/10.3390/rs13183585.
  • Yang, Xiaofei, Weijia Cao, Yao Lu, and Yicong Zhou. 2022. “Hyperspectral Image Transformer Classification Networks.” IEEE Transactions on Geoscience and Remote Sensing 60:1–15.
  • Yao, Jing, Bing Zhang, Chenyu Li, Danfeng Hong, and Jocelyn Chanussot. 2023. “Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework.” IEEE Transactions on Geoscience and Remote Sensing 61:1–15.
  • Yao, Min, Yaozu Zhang, Guofeng Liu, and Dongdong Pang. 2024. “SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 17:3023–3037. https://doi.org/10.1109/JSTARS.2024.3349657.
  • Yasir, Muhammad, Wan Jianhua, Liu Shanwei, Hui Sheng, Xu Mingming, and Md Hossain. 2023. “Coupling of Deep Learning and Remote Sensing: A Comprehensive Systematic Literature Review.” International Journal of Remote Sensing 44 (1): 157–193. https://doi.org/10.1080/01431161.2022.2161856.
  • Ye, Huanran, Sheng Liu, Kun Jin, and Haohao Cheng. 2021. “CT-UNet: An Improved Neural Network Based on U-Net for Building Segmentation in Remote Sensing Images.” In 2020 25th International Conference on Pattern Recognition (ICPR), 166–172.
  • Yu, Fisher, and Vladlen Koltun. 2015. “Multi-Scale Context Aggregation by Dilated Convolutions.” arXiv preprint arXiv: 1511.07122.
  • Yu, Long, Zhiyin Wang, Shengwei Tian, Feiyue Ye, Jianli Ding, and Jun Kong. 2017. “Convolutional Neural Networks for Water Body Extraction From Landsat Imagery.” International Journal of Computational Intelligence and Applications 16 (1): 1750001. https://doi.org/10.1142/S1469026817500018.
  • Yuan, Yuhui, Xilin Chen, and Jingdong Wang. 2020. “Object-Contextual Representations for Semantic Segmentation.” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, 173–190.
  • Yuan, Xiaohui, Jianfang Shi, and Lichuan Gu. 2021. “A Review of Deep Learning Methods for Semantic Segmentation of Remote Sensing Imagery.” Expert Systems with Applications 169:114417. https://doi.org/10.1016/j.eswa.2020.114417.
  • Yuan, Jiangye, DeLiang Wang, and Rongxing Li. 2014. “Remote Sensing Image Segmentation by Combining Spectral and Texture Features.” IEEE Transactions on Geoscience and Remote Sensing 52 (1): 16–24. https://doi.org/10.1109/TGRS.2012.2234755.
  • Yuan, Kunhao, Xu Zhuang, Gerald Schaefer, Jianxin Feng, Lin Guan, and Hui Fang. 2021. “Deep-Learning-based Multispectral Satellite Image Segmentation for Water Body Detection.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14: 7422–7434. https://doi.org/10.1109/JSTARS.2021.3098678.
  • Zang, Ning, Yun Cao, Yuebin Wang, Bo Huang, Liqiang Zhang, and P. Takis Mathiopoulos. 2021. “Land-Use Mapping for High-Spatial Resolution Remote Sensing Image Via Deep Learning: A Review.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:5372–5391. https://doi.org/10.1109/JSTARS.2021.3078631.
  • Zhang, Tianxiang, Yuanxiu Cai, Peixian Zhuang, and Jiangyun Li. 2024. “Remotely Sensed Crop Disease Monitoring by Machine Learning Algorithms: A Review.” Unmanned Systems 12 (01): 161–171.
  • Zhang, Hang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. 2018. “Context Encoding for Semantic Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7151–7160.
  • Zhang, Fan, Bo Du, Liangpei Zhang, and Lefei Zhang. 2016. “Hierarchical Feature Learning with Dropout K-Means for Hyperspectral Image Classification.” Neurocomputing 187: 75–82. https://doi.org/10.1016/j.neucom.2015.07.132.
  • Zhang, Xin, Liangxiu Han, Yingying Dong, Yue Shi, Wenjiang Huang, Lianghao Han, Pablo González-Moreno, et al. 2019. “A Deep Learning-Based Approach for Automated Yellow Rust Disease Detection From High-Resolution Hyperspectral UAV Images.” Remote Sensing 11 (13): 1554. https://doi.org/10.3390/rs11131554.
  • Zhang, Cheng, Wanshou Jiang, Yuan Zhang, Wei Wang, Qing Zhao, and Chenjie Wang. 2022. “Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-resolution Remote Sensing Imagery.” IEEE Transactions on Geoscience and Remote Sensing 60:1–20.
  • Zhang, Deyuan, Zhenghong Liu, and Xiangbin Shi. 2020. “Transfer Learning on Efficientnet for Remote Sensing Image Classification.” In 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), 2255–2258.
  • Zhang, Zhengxin, Qingjie Liu, and Yunhong Wang. 2018. “Road Extraction by Deep Residual U-Net.” IEEE Geoscience and Remote Sensing Letters 15 (5): 749–753. https://doi.org/10.1109/LGRS.2018.2802944.
  • Zhang, Le, Jinsong Wang, and Zhiyong An. 2020. “Classification Method of CO2 Hyperspectral Remote Sensing Data Based on Neural Network.” Computer Communications 156:124–130. https://doi.org/10.1016/j.comcom.2020.03.045.
  • Zhang, Tianxiang, Wenxuan Wang, Jing Wang, Yuanxiu Cai, Zhifang Yang, and Jiangyun Li. 2022. “Hyper-LGNet: Coupling Local and Global Features for Hyperspectral Image Classification.” Remote Sensing 14 (20): 5251. https://doi.org/10.3390/rs14205251.
  • Zhang, Tianxiang, Zhiyong Xu, Jinya Su, Zhifang Yang, Cunjia Liu, Wen-Hua Chen, and Jiangyun Li. 2021. “Ir-unet: Irregular Segmentation U-Shape Network for Wheat Yellow Rust Detection by UAV Multispectral Imagery.” Remote Sensing 13 (19): 3892. https://doi.org/10.3390/rs13193892.
  • Zhang, Ning, Guijun Yang, Yuchun Pan, Xiaodong Yang, Liping Chen, and Chunjiang Zhao. 2020. “A Review of Advanced Technologies and Development for Hyperspectral-Based Plant Disease Detection in the Past Three Decades.” Remote Sensing 12 (19): 3188. https://doi.org/10.3390/rs12193188.
  • Zhang, Tianxiang, Zhifang Yang, Zhiyong Xu, and Jiangyun Li. 2022. “Wheat Yellow Rust Severity Detection by Efficient DF-UNet and UAV Multispectral Imagery.” IEEE Sensors Journal 22 (9): 9057–9068. https://doi.org/10.1109/JSEN.2022.3156097.
  • Zhang, Liangpei, Lefei Zhang, Dacheng Tao, and Xin Huang. 2012. “Tensor Discriminative Locality Alignment for Hyperspectral Image Spectral–spatial Feature Extraction.” IEEE Transactions on Geoscience and Remote Sensing 51 (1): 242–256. https://doi.org/10.1109/TGRS.2012.2197860.
  • Zhao, Zhicheng, Ze Luo, Jian Li, Can Chen, and Yingchao Piao. 2020. “When Self-Supervised Learning Meets Scene Classification: Remote Sensing Scene Classification Based on a Multitask Learning Framework.” Remote Sensing 12 (20): 3276. https://doi.org/10.3390/rs12203276.
  • Zhao, Hengshuang, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. “Pyramid Scene Parsing Network.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890.
  • Zhong, Zilong, Ying Li, Lingfei Ma, Jonathan Li, and Wei-Shi Zheng. 2021. “Spectral–spatial Transformer Network for Hyperspectral Image Classification: A Factorized Architecture Search Framework.” IEEE Transactions on Geoscience and Remote Sensing 60:1–15. https://doi.org/10.1109/TGRS.2022.3225267.
  • Zhou, Yongxiu, Honghui Wang, Ronghao Yang, Guangle Yao, Qiang Xu, and Xiaojuan Zhang. 2022. “A Novel Weakly Supervised Remote Sensing Landslide Semantic Segmentation Method: Combining CAM and CycleGAN Algorithms.” Remote Sensing 14 (15): 3650. https://doi.org/10.3390/rs14153650.
  • Zhou, Shaoqing, Xiaoman Zhang, Shiwei Chu, Tiantian Zhang, and Junfei Wang. 2023. “Research on Remote Sensing Image Carbon Emission Monitoring Based on Deep Learning.” Signal Processing 207: 108943. https://doi.org/10.1016/j.sigpro.2023.108943.
  • Zou, Xueyan, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. 2023. “Segment Everything Everywhere All At Once.” arXiv preprint arXiv: 2304.06718.
  • Zou, Yuliang, Zizhao Zhang, Han Zhang, Chun-Liang Li, Xiao Bian, Jia-Bin Huang, and Tomas Pfister. 2020. “Pseudoseg: Designing Pseudo Labels for Semantic Segmentation.” arXiv preprint arXiv: 2010.09713.