Full article: A spectral and spatial transformer for hyperspectral remote sensing image super-resolution

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Due to the generally low spatial resolution of hyperspectral images (HSIs), early multispectral images lacked corresponding panchromatic bands, and as a result, fusion methods could not be used to enhance resolution. Many researchers have proposed various image super-resolution methods to address these limitations. However, these methods still suffered from issues, such as inadequate feature representation, lack of spectral feature representation, and high computational cost and inefficiency. To address these challenges, a spectral and spatial transformer (SST) algorithm for hyperspectral remote sensing image super-resolution is introduced. This algorithm uses a spatial transformer structure to extract the spatial features between the image pixels and a spectral transformer structure to extract the spectral features within the image pixels. The integration of these two components is applied to HSI super-resolution. After comparative experiments with currently advanced methods on three publicly available hyperspectral datasets, the results consistently show that our algorithm has better performance in both spectral fidelity and spatial restoration. Furthermore, our proposed algorithm was applied to real-world super-resolution experiments in the region of China's Ruoergai National Park, and subsequently, pixel-based classification was conducted on the super-resolution images, the results indicate that our algorithm could also be applied to future remote sensing interpretation tasks.

KEYWORDS:

1. Introduction

Hyperspectral remote sensing images (HSIs) are digital images captured by sensors on airborne or satellite platforms that measure the reflection of electromagnetic waves from the Earth's surface in multiple narrow and continuous bands; this process results in high spectral resolution data. The data obtained from HSIs provide a wealth of information on the chemical and physical properties of ground objects, enabling accurate classification and identification of various ground objects. HSIs have found wide applications in fields, such as mineral exploration (Bedini Citation2017; Govil et al. Citation2021), agriculture (Nguyen et al. Citation2021; Zhang et al. Citation2016), urban planning (Navin and Agilandeeswari Citation2020; Weber et al. Citation2018), and environmental monitoring (Niu et al. Citation2021; Stuart, McGonigle, and Willmott Citation2019). In the field of data processing, there is a continuous influx of studies focused on HSIs, including classification (Chen et al. Citation2020; Zhang et al. Citation2023; Zhu, Deng, et al. Citation2021), recognition (Gao et al. Citation2021), super-resolution (Gong et al. Citation2022), and unmixing (Zhu, Wang, et al. Citation2021). Nonetheless, limited spatial resolution presents a notable disadvantage to the broad utilization of HSIs. This constraint is frequently linked to various factors, including sensor limitations, photography height, and cost (Fu, Liang, and You Citation2021). Additionally, a disparity in the spatial resolutions between early HSIs (multispectral images) and those from later years introduces an element of uncertainty in long-term temporal studies and practical applications. Simultaneously, the differences in remote sensing image resolutions will significantly constrain the application of many new and excellent algorithms to subsequent remote sensing interpretation or classification tasks (Wang et al. Citation2023a; Wang et al. Citation2023b). Hence, maintaining or improving spatial resolution while retaining spectral resolution is a significant challenge.

Image super-resolution (SR) reconstruction is an effective approach to address the aforementioned issues. Its objective is to produce a high-resolution image from a low-resolution input image by leveraging the information in the input image, prior knowledge, or additional data from external sources to recover lost high-frequency details. Image super-resolution can be classified into two main categories: multi-image super-resolution (Dian et al. Citation2021; Li et al. Citation2018) and single-image super-resolution (Jiang et al. Citation2020), depending on the quantity of input images. Multi-image super-resolution typically encompasses the fusion of high-resolution panchromatic or multispectral images with low-resolution HSIs, with the aim to enhance the spatial resolution. In recent years, research efforts dedicated to devising innovative algorithms to achieve this goal has shown a notable increase. Some of these include using progressive multiscale deformable residual networks for multi-image super-resolution processing (Liu, Feng, et al. Citation2022) and utilizing end-to-end deep neural networks for the multi-image super-resolution task (Arefin et al. Citation2020). These studies are instrumental in advancing multi-image super-resolution technology. Nonetheless, multiple-image super-resolution methods exhibit significant limitations, specifically, the need to incorporate high-resolution auxiliary images during processing. Thus, when these auxiliary images prove challenging to obtain, the efficacy of the multiple-image super-resolution becomes restricted. This situation is exemplified in applications involving long-term temporal sequences, such as Landsat 5 TM imagery, which lacks a panchromatic band for image fusion. Moreover, multi-image super-resolution is computationally intensive and time-consuming (Chen et al. Citation2022; Li et al. Citation2020).

Single image super-resolution (SISR) is a widely used technique for generating high-resolution images by interpolating and restoring low-resolution images. Unlike multi-image super-resolution methods, SISR does not require auxiliary images, making it more versatile. With the advancement of image super-resolution technology, numerous researchers have applied these methods or variations to HSI super-resolution research, achieving impressive results. Examples include HSI super-resolution based on prior knowledge models (Gong et al. Citation2022), HSI super-resolution for faces using convolutional backbones and spectral splitting networks (Jiang et al. Citation2022; Jiang et al. Citation2022), and HSI super-resolution algorithms employing alternate 2D/3D convolution neural networks (Li, Wang, and Li Citation2021). Convolutional neural networks (CNN)-based models (LeCun et al. Citation1998) have produced significant outcomes in image super-resolution tasks; however, they have certain limitations. For instance, employing a uniform convolution kernel to convolve various image regions is suboptimal when addressing the challenges of image super-resolution. Moreover, the convolution kernel usually extracts local features, which may not satisfy the need to handle a wide range of features for super-resolution. Although methods based on generative adversarial networks (GANs) (Goodfellow et al. Citation2014) perform well in the field of super-resolution, they are unstable in the training process and require substantial computing resources and time; thus, they are less suitable for small teams or individuals (Chai et al. Citation2022).

The transformer (Vaswani et al. Citation2017) is a neural network model that relies on the self-attention mechanism and was initially developed for natural language processing, including machine translation (Vaswani et al. Citation2017) and text generation (Dong et al. Citation2017; Gong, Crego, and Senellart Citation2019; Zhu et al. Citation2019). Unlike traditional recurrent neural networks (RNNs) (Hochreiter and Schmidhuber Citation1997) or convolutional neural networks (CNNs), the transformer does not require sequential processing through recurrent or convolutional structures. Instead, it uses the self-attention mechanism to weigh and aggregate elements in a sequence, enabling sequence-to-sequence modeling. Recently, the transformer has demonstrated remarkable performance in image classification (Dosovitskiy et al. Citation2021; Touvron et al. Citation2021; Wang et al. Citation2022), object detection (Carion et al. Citation2020; Jiang et al. Citation2022; Jiang et al. Citation2022), image reconstruction (Chen, Zheng, and Lu Citation2021), and image segmentation (Chen et al. Citation2021). Furthermore, the exploration of utilizing the transformer for image super-resolution was undertaken by Liu et al. (Citation2018), and other algorithms, such as efficient super-resolution transformer (ESRT) (Lu et al. Citation2022) and SwinIR, have shown impressive results. However, it is widely believed, including by our team, that conventional deep learning super-resolution methods are not suitable for HSIs (He et al. Citation2022; Lei, Shi, and Mo Citation2021; Liu, Hu, et al. Citation2022; Tu et al. Citation2022). This is because HSIs can have hundreds of bands, making the interactions between the channels crucial and irreplaceable. Additionally, the relationships between multiple spectral channels can enhance the image's super-resolution performance since it captures more information and can extract more features. Conversely, conventional methods are more focused on extracting the features between pixels since regular images only have three color bands (red, green, and blue). To address these challenges, certain researchers have utilized CNNs to understand the spatial-spectral prior knowledge for HSI super-resolution (Jiang et al. Citation2020). Conversely, individuals such as Liu, Hu, et al. (Citation2022) have adopted a hybrid approach, intertwining CNNs and transformers to concurrently extract the spatial and spectral attributes from remote sensing images. Although these methods consider the spectral feature extraction of remote sensing images, they are still entirely based on CNN or heavily rely on it; additionally, their super-resolution performance has room for improvement.

In summary, due to the generally low spatial resolution of HSIs and the inability to employ fusion methods for resolution enhancement in early multispectral images lacking corresponding panchromatic bands, numerous researchers have introduced various approaches to single-image super-resolution. However, these methods still face challenges, such as inadequate feature representation, a lack of spectral feature representation, and issues concerning model efficiency and cost-effectiveness. To address these challenges, we introduced an innovative solution: the spectral and spatial transformer (SST) super-resolution algorithm for HSIs. Specifically, the embedding of low-resolution images underwent several integrated transformer blocks to perform advanced feature extraction. The integrated transformer blocks were connected by residuals to further enhance the learned features. The high-level features consisted of spatial and spectral features; these were extracted separately by the spatial transformer and spectral transformer of the integrated transformer block, respectively. Additionally, we introduced a 3D transformer module that simultaneously extracted the spectral and spatial features. We conducted comparative experiments to show the advantages and disadvantages of both methods. After the comparative experiments on the three publicly available HSI datasets, the results clearly indicated that our proposed method outperformed other methods across all evaluation metrics. This robustly showed our method's ability to extract both the spectral and spatial features inherent in the images. Furthermore, the proposed algorithm was applied to real-world super-resolution experiments within the region of China's Ruoergai National Park, and subsequently, pixel-based classification was conducted on the experimental results; based on this, our proposed algorithm was well suited for future remote sensing interpretation tasks. In summary, the main contributions of this article are as follows.

We propose a super-resolution method for HSIs that utilizes spatial-transformer and spectral-transformer structures to extract the spatial and spectral features, respectively; these are then integrated to achieve super-resolution.
To overcome the issue of large data volume, we incorporated the alternating sliding window method in the spatial-transformer module.
In addition, we designed a 3D transformer structure that could simultaneously extract both the spatial and spectral features, and experiments were conducted to analyze its performance.
An experiment on super-resolution was carried out within the Ruoergai National Park region, and the results were utilized for wetland landscape classification, affirming our proposed algorithm's viability for future research endeavors.

The remaining sections of the paper are organized as follows: Section 2 discusses related work. Section 3 presents a detailed description of our proposed SST algorithm. Section 4 contains the experimental setup and results. Section 5 delves into the discussion of some meaningful questions. Section 6 provides conclusions and future directions.

2. Related work

2.1. SR of HSIs based on traditional methods

Image super-resolution reconstruction is a traditional image processing problem. As far back as the 1990s when hyperspectral remote sensing was emerging, researchers began exploring methods to improve the spatial resolution of HSIs. Interpolation was the earliest method used to address this issue and increased the image resolution by filling in missing pixels. Common interpolation methods include nearest-neighbor interpolation, bilinear interpolation, and bicubic interpolation. However, these traditional interpolation methods suffer from issues, such as image blurring and loss of detail. To meet higher demands, researchers subsequently introduced various nonuniform image interpolation methods (Han, Park, and Joo Citation2015; Jingmeng et al. Citation2015; Tao et al. Citation2003) to prevent post interpolation boundary blurring and detail loss. They also used methods based on the dual-tree complex wavelet transform (Solanki, Israni, and Shah Citation2018) to obtain high-resolution images without the checkerboard artifacts. Methods based on probabilistic models (Irmak et al. Citation2016; Tao, Feng, and Bao Citation2018), iterative back-projection methods (Li et al. Citation2014), and their improvements (Deepa and Islam Citation2020) have enhanced image reconstruction quality from various angles. Additionally, many learning-based methods have been proposed (Xinlei and Naifeng Citation2016; Zhang et al. Citation2015; Zhang, Du, and Lu Citation2017). These methods either leverage the internal similarities in the image or learn the mapping between low-resolution and high-resolution images. While they focus on learning and dictionary optimization, other steps are rarely optimized or considered within a unified framework. In summary, although traditional methods for hyperspectral remote sensing image super-resolution have achieved some success, they often face limitations such as information loss and noise amplification. Consequently, with the development of deep learning techniques, deep learning methods have made significant breakthroughs and can effectively enhance the spatial resolution of HSIs. However, traditional methods still prove useful in specific scenarios, particularly when dealing with limited data or computational resources.

2.2. SR of HSIs based on deep learning methods

With the advancement of computing capabilities, deep learning has found increasingly widespread applications. As deep learning methods have evolved, they have gradually been applied to HSI super-resolution. This includes methods based on back propagation (BP) neural networks (Ding and Bian Citation2008; Wen and Yuan-fei Citation2011), methods based on CNNs (Dong et al. Citation2015; Lei, Shi, and Zou Citation2017), methods based on GANs (Shi et al. Citation2022; Zhang et al. Citation2021), and methods based on transformers (Ma et al. Citation2023; Liu et al., Citation2022). These methods have each achieved state-of-the-art results at different times, but they also have their respective shortcomings. Specifically, BP neural network methods have a straightforward implementation but require a large amount of data to significantly converge. They also lack feature extraction capabilities; thus, hyperspectral remote sensing image super-resolution based on BP neural networks is challenging for large-scale applications. Early methods based on CNNs struggled to learn deep-level features due to their relatively shallow structures. However, with the introduction of residual structures and increasingly deep networks, they became mainstream in HSI super-resolution tasks due to their excellent feature extraction capabilities. Methods based on GANs are capable of generating high-quality results and have the ability to automatically learn features. They exhibit good transferability. However, they still face challenges such as high training costs and the need for large volumes of training data. Transformer-based methods, with the presence of self-attention mechanisms, excel in capturing global features. They provide excellent scalability since they do not downsample the features. However, they face also challenges related to insufficient consideration of spectral features, high computational demands, and limited interpretability. In summary, deep learning-based methods have significantly improved the effectiveness of hyperspectral remote sensing image super-resolution and continue to evolve.

2.3. SR of HSIs with integrating spectral and spatial features

Hyperspectral remote sensing super-resolution tasks require not only higher spatial resolution but also higher spectral fidelity, necessitating the joint processing of spectral and spatial features. Many researchers have attempted to address this issue, using the spectral-spatial network (SSN, Jia et al. Citation2018), spectral-spatial residual network (SSRN, Chen, Zheng, and Lu Citation2021), Interactformer (Liu et al. Citation2022), and dense spectral transformer (DsTer, He et al. Citation2022). Since these methods are tailored for hyperspectral image super-resolution tasks, they have achieved better results compared to earlier methods. However, most current methods that consider the integration of spectral and spatial features are based on CNNs. The downsampling mechanisms in CNNs can still limit the effectiveness of super-resolution. Transformer-based methods effectively address the issue of excessive downsampling. Nevertheless, the exploration of hyperspectral image super-resolution using this approach is in its early stages, and the generalization capabilities of models vary, resulting in diverse outcomes. There is significant room for improvement in this research area.

3. Methodology

In this section, we describe the proposed network in detail, which includes the overall network structure, the integrated transformer block, the spectral transformer block that operates in the spectral domain, the spatial transformer block that operates in the spatial domain, and the 3D transformer block for extracting hybrid features.

3.1. Network architecture

(a) illustrates the overall architecture of the SST network; here, a low-resolution HSI is used as the input and a high-resolution HSI of the same spectral bands is output. The network accomplishes this by changing the size of the image while preserving the spectral information. The process can be summarized as follows. First, the low-resolution image is used to extract shallow features via linear embedding in the spectral domain. Then, the deep features, including both the spectral and spatial information, are extracted using multiple residual groups and merging layers. The resulting feature image is upsampled to a preset size (×2, × 4, or ×8) and merged with the processed low-resolution image through a global skip connection. Finally, the high-resolution image is obtained by applying the inverse embedding layer to the merged image.

Figure 1. SST network architecture. (a) Overall network structure of the SST model; (b) integrated transformer block.

3.1.1. Linear embedding

The input image $I_{LR} \in R^{H \times W \times C_{in}}$ is fed into the linear embedding layer H_E (·) to obtain the embedded image $F_{0} = R^{H \times W \times C}$ . Here, H and W represent the height and width of the input image, respectively, while C_in and C represent the number of bands of the original input image and the number of channels of the embedded feature image, respectively. The embedding process can be expressed mathematically as follows: (1) $F_{0} = H_{E} (I_{LR})$ (1) The embedding process in SST is accomplished using a 3 × 3 convolution with a padding value of 1, which preserves the input image size. This operation maps the original input image's band information to a higher dimensional feature space and extracts shallow features, for which convolutions are well suited.

3.1.2. Residual group

The residual network (He et al. Citation2016) was introduced by Kaiming He and colleagues at Microsoft Research Lab in 2016. This architecture has made a significant contribution to addressing the degradation problem of the deep networks, and SST draws inspiration from it. Specifically, SST employs K residual groups to extract deep image features. Each residual block is connected in series to M integrated transformer blocks, which are followed by a merging layer, and residual connections are added afterward. The overall process can be described as follows: (2) $F_{i} = H_{R G_{i}} (F_{i - 1}), i = 1, 2, \dots, K$ (2) where H_RGi (·) denotes the ith residual block, and F_i represents the intermediate features extracted from it, with the same height, width, and number of channels. Note that F₀ is the feature image output by the linear embedding layer, and F_K is the feature image output after passing through K residual blocks.

3.1.3. Merging layer

The merging layer in SST, whether within the residual groups or after them, serves to integrate the extracted features and extract the deeper features, thus bolstering the model's capacity to effectively capture the data patterns. The merging layer in SST consists of three convolutions that are alternately connected to two activation functions, as depicted in . To prevent downsampling of the feature image, a padding value of 1 is applied to the convolutional layer. The merging process outside the residual groups can be expressed mathematically as follows: (3) $F_{M} = H_{M} (F_{K})$ (3) where H_M (·) represents the merging layer and F_M is the feature image obtained after merging.

Figure 2. Structure of the merging layer.

3.1.4. Upsampling layer

To produce high-resolution images, SST combines the shallow feature F₀ obtained by the linear embedding layer with the deep feature F_M obtained by the residual blocking and merging layer. The combined feature image is then upsampled as follows: (4) $F_{HF} = H_{U} (F_{0} + F_{M})$ (4) where H_U (·) denotes the upsampling function, and F_HF represents the feature image after upsampling. Shallow features tend to include the lower-frequency aspects inherent in remote sensing images, while deep features excel at understanding the higher-frequency intricacies. These two types of features are combined via skip connections. In the context of SST, the subpixel convolution layer is used to perform the upscaling of the feature image.

3.1.5. Global skip connection

To prevent model degradation and achieve higher accuracy and faster convergence speed, SST incorporates a global skip connection. In this procedure, the initial step involves upsampling the low-resolution input image through the utilization of a straightforward bicubic interpolation technique. The resulting image is then used to extract a feature image using a convolutional layer with padding set to 1. The feature image extracted in this manner is added to FHF to obtain the combined feature image. Mathematically, this can be expressed as follows: (5) $F_{SK} = F_{HF} + H_{cov} (L_{LR} ↑)$ (5) where H_cov is the convolutional layer and $↑$ represents the bicubic interpolation upsampling process.

3.1.6. Inverse embedding layer

The linear embedding layer facilitates the transformation of the channel count in the input image from C_in to C, which is typically C > C_in. The role of the inverse embedding layer is to restore the channel count in the image to match the band count in the original remote sensing image, thereby completing the super-resolution of the remote sensing image. The process can be expressed as follows: (6) $I_{HR} = H_{UE} (F_{SK})$ (6) where H_UE (·) represents the inverse embedding function and I_HR represents the output high-resolution remote sensing image.

3.1.7. Loss function

The role of the loss function is to gauge the quality of the reconstructed remote sensing image, serving as the ultimate objective for the optimization process. In our experimentation, we employ two distinct loss functions to evaluate the disparity between the reconstructed high-resolution image and the ground truth image. This disparity is attributed to the significance of attaining elevated spatial resolution and preserving the spectral fidelity in the realm of remote sensing image super-resolution.

When measuring the accuracy of remote sensing images in the spatial domain, we select the widely used L₁ loss (mean absolute error). Compared with L₂ loss (mean square error), L₁ loss does not excessively penalize large errors and has better convergence. The L₁ loss can be expressed as follows: (7) $L_{1} = \frac{1}{HWC} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} | I_{GT} (i, j, c) - I_{SR} (i, j, c) |$ (7) where I_GT represents the ground truth image and I_SR represents the reconstructed high-resolution image.

When measuring the spectral accuracy of remote sensing images, we select the spectral gradient loss function, which is widely used in measuring the fidelity of spectral information. The function L_G can be expressed as follows: (8) $L_{G} = \frac{1}{HWC} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} | I_{GT}^{G} (i, j, c) - I_{SR}^{G} (i, j, c) |$ (8) where $I_{GT}^{G}$ and $I_{SR}^{G}$ represent the spectral gradients of the ground truth image and the reconstructed high-resolution image, respectively. In SST, the final loss function used is as follows: (9) $L_{total} = λ_{1} L_{1} + λ_{2} L_{G}$ (9) where λ₁ and λ₂ are hyperparameters that balance the two loss functions.

3.2. Integrated transformer block

The proposed integrated transformer block improves upon the traditional transformer block in two ways ((b)). First, it replaces the conventional self-attention mechanism (MSA) with a series of spectral multihead self-attention (Spectral-MSA) and spatial multihead self-attention (Spatial-MSA) mechanisms. Second, it uses three-dimensional multihead attention (3D-MSA) to combine the computation of the spectral and spatial features. The advantages and disadvantages of these two methods will be discussed in the following ablation analysis section. An integrated transformer block consists of either a 3D-MSA or two MSA modules that operate in the spectral and spatial domains, respectively, and a fully connected module (multilayer perceptron, MLP). The fully connected module consists of two MLPs with a GELU activation function in between them. LayerNorm is implemented prior to the MSA and MLP modules, with residual connections interposed between every individual module. The process can be expressed as follows: (10) $\overset{\land}{Z} = MS A_{spatial} (MS A_{spectral} (LN (X))) + X$ (10) (11) $Z = MLP (LN (\overset{\land}{Z}))) + \overset{\land}{Z}$ (11) or: (12) $\overset{\land}{Z} = MS A_{3 D} (LN (X)) + X$ (12) (13) $Z = MLP (LN (\overset{\land}{Z}))) + \overset{\land}{Z}$ (13) The proposed algorithm utilizes multiple integrated transformer blocks concatenated together. For the first integrated transformer block, $X \in R^{H \times W \times C}$ denotes the embedded feature image. Moreover, X corresponds to the self-attention feature image that was output by the previous integrated transformer block. MSA_spatial and MSA_spectral denote the spatial multihead self-attention and spectral multihead self-attention layers, respectively.

Both the 3D-MSA module and the MSA module, which operate in the spectral and spatial domains, have great significance for remote sensing image super-resolution tasks. In the realm of computer vision tasks, a conventional image usually consists of three channels that correspond to the red, green, and blue color values. However, this scenario significantly diverges from multispectral images or HSIs, wherein the number of bands (equivalent to channels) frequently exceed three. Remarkably, each of these bands holds distinct information vital for remote sensing images. Due to this disparity, we purposefully devised the integrated transformer block tailored to address the demands of remote sensing image super-resolution tasks within the SST framework. Serving as the foundational component of the residual groups, this unit plays a pivotal role in the extraction of comprehensive image features, consisting of both spectral and spatial attributes. These attributes are harnessed through the utilization of the spectral transformer block, spatial transformer block, or 3D transformer block.

3.2.1. Spectral transformer block

In recent research on computer vision tasks utilizing transformers, such as the Vision Transformer (ViT) (Dosovitskiy et al. Citation2021) and the Swin-Transformer (Liu et al. Citation2021), self-attention computation methods focus on pixels rather than channels. The primary rationale for this distinction is that traditional images commonly consist of merely three channels. Nonetheless, within the domain of remote sensing, particularly hyperspectral imaging (HSIs), the number of channels can surge into hundreds, rendering interchannel self-attention a substantial factor to consider. In SST, the spectral transformer block is specifically designed to compute the self-attention between the bands of the remote sensing images.

illustrates the operation of the spectral transformer block. We use the standard multihead attention mechanism of the transformer. At this stage, when the feature images are passed through the multihead attention layer, the query, key, and value matrices $Q, K, V \in R^{C_{in} \times d}$ need to be calculated for each input pixel embedding $X_{ij} \in R^{C_{in} \times C / C_{in}}$ in each head: (14) $Q = X_{ij} W^{Q}, K = X_{ij} W^{K}, V = X_{ij} W^{V}$ (14) where $W^{Q}$ , $W^{K}$ and $W^{V}$ represent the projection matrices of the query, key and value, respectively, and are shared among the different pixels; and i and j represent the position index of the pixel. The method of calculating the attention matrix of the current head through the self-attention mechanism is as follows: (15) $Attention (Q, K, V) = SoftMax (Q K^{T} / \sqrt{d} + B) V$ (15) where d is the dimension of the query/key/value, and B is the relative position coding and is a set of learnable parameters. Finally, the calculation results from N heads are obtained by executing the attention function N times in parallel, and they are connected in series and multiplied by the weight matrix W^O to obtain the self-attention feature $Z_{ij} \in R^{C_{in} \times C / C_{in}}$ of the pixel.

Figure 3. Spectral transformer block workflow.

3.2.2. Spatial transformer block

ViT pioneered the use of the transformer architecture in image processing tasks, which has the advantage of retaining more spatial information compared to traditional CNNs (Raghu et al. Citation2021). Swin-Transformer further improves the extraction of local self-attention features, addressing the limitations of single feature graphs and low resolutions in ViT while also reducing the computational complexity. The super-resolution process in remote sensing images entails the mapping of pixel counts from 1 to 4 (for ×2 super-resolution tasks) or even higher. Compared to image classification tasks, super-resolution tasks are more sensitive to local features than global features. In SST, the spatial-transformer block draws on the Swin-Transformer approach to divide remote sensing images into local windows, restricting the self-attention calculation in these windows and supplementing the global features using an alternating sliding window mechanism. In addition, to prevent excessive downsampling of the feature maps, we designed a merging layer to replace patch merging in the Swin-Transformer, as described in Section 3.1. As shown in , the feature image size H × W × C is input, and the spatial transformer segments this into $\frac{H \times W}{M^{2}}$ nonoverlapping windows, each with a shape of M × M × C. The self-attention value between M² pixels is then calculated separately for each window using the standard self-attention calculation method, similar to the approach used in Section 3.2.2.

Figure 4. Window segmentation method in the spatial transformer block.

In the spatial-transformer block, self-attention calculations are confined to a single window; thus, the attention between windows cannot be computed. To address this, a combination of local windows and sliding windows is used to facilitate cross-window connections. Since super-resolution tasks prioritize the extraction of local features over image classification tasks, our algorithm reduces the frequency of shifted the window usage compared to the Swin-Transformer. To elaborate on the specific implementation, prior to the segmentation window step in each residual group block, the feature image is shifted downward and to the right by $\frac{M}{2}$ pixels.

3.2.3. 3D transformer block

The spectral transformer block extracts the self-attention features between the different bands within a single pixel, while the spatial transformer block extracts the self-attention features between different pixels in local windows. Conversely, the 3D transformer block computes the spectral and spatial features in parallel, bypassing the sequential structure utilized by the preceding two blocks. Specifically, the 3D transformer block extracts the self-attention features between all bands in a local window ((a)).

Figure 5. Method for calculating the self-attention within a single window. (a) Extraction of the 3D transformer block self-attention features of all bands in a local window. (b) Spatial and spectral-MSA mechanism and 3D-MSA mechanism.

The 3D transformer block uses the same window segmentation method as the spatial transformer block, resulting in local windows with a shape of M × M × C after segmentation. The key difference is that the number of embedded channels C is divided into C_in, which is the band count in the original HSI. Thus, each local window contains M × M × C_in computing units, and the self-attention values are calculated between these units using the standard MSA method, similar to the approach used for MSA in Section 3.2.2.

The main differences between the calculation methods of the 3D-MSA mechanism and the spatial and spectral-MSA mechanisms are primarily shown in two aspects ((b)).

The target of operation is different. In the spatial-MSA mechanism, attention is computed among the different pixels within a single window, while in the spectral-MSA mechanism, attention is computed among the different spectral encodings within a single pixel. Moreover, the 3D-MSA mechanism computes attention among all pixels and spectral encodings within a single window.
The scale of operation is different. Spatial and spectral-MSA mechanisms use a sequential structure, first performing spectral-MSA and then spatial-MSA. In contrast, the 3D-MSA mechanism uses a holistic structure, considering different pixels and their spectral encodings as a whole when calculating self-attention. Thus, the spatial and spectral-MSA mechanisms cannot directly capture self-attention between the different pixels and spectral encodings.

Notably, regardless of the approach, the calculation of multihead self-attention follows the traditional transformer algorithm.

4. Experiments

In this section, we used three extensively utilized hyperspectral remote sensing datasets to assess the efficacy of our proposed method. Additionally, we provided the experimental parameter configurations and evaluation metrics. By conducting ablation experiments, we dissected the algorithm's strengths and limitations. Furthermore, we juxtaposed the experimental outcomes with those of state-of-the-art methods, supplemented by a thorough analysis.

4.1. Dataset and preprocessing

To validate the proposed method and compare it with other algorithms, we conducted comparative experiments on three commonly used hyperspectral datasets: Pavia Centre (Huang and Zhang Citation2009), Houston (Debes et al. Citation2014), and Chikusei (Yokoya and Iwasaki Citation2016). The Pavia Centre Dataset was obtained by the ROSIS sensor during the flight over Pavia in northern Italy. The image had a ground sampling distance of 1.3 m, 102 spectral bands, and an image size of 1096 × 1096 × 102. However, some regions without information were included in the scene; thus, the size of the remaining effective region was 1096 × 700 × 102 after eliminating noninformation region during the experiment. The Houston Dataset was captured at the University of Houston campus and adjacent urban areas, with 144 spectral bands in the 380 nm to 1050 nm wavelength range, a ground sampling distance of 2.5 m, and an image size of 349 × 1905 × 144. The size of the no-information area after removing the edge was 1900 × 340 × 144. The Chikusei Dataset was obtained by the Headwall Hyperspec-VNIR-C sensor in the rural and urban areas of Chikusei, Ibaraki, Japan. The scene centers are located at the coordinates: 36.294 N, 140.008E. This dataset had 128 bands, with a wavelength range of 363 nm to 1,018 nm, and was composed of 2517 × 2335 pixels with a ground sampling distance of 2.5 m. Due to the problem of missing information at the image edge, the image was cropped to 2304 × 2048 pixels during use.

During the experimental process, 70% of the regions in the Pavia Centre and Chikuse datasets were designated as the training set, while the remaining 30% were used as the test set. The Houston dataset was divided into training, validation, and test sets in an 8:1:1 ratio. The training set data were cropped to 48 × 48 pixels and underwent random folding or rotation during training. The validation and test set data were cropped to 128 × 128 pixels. Notably, regardless of the dataset, the HSIs obtained through cropping from these three datasets were treated as real data, and the low-resolution images were downsampled from these data using scale factors of 2, 4, and 8 during the experiments.

4.2. Parameter and evaluation indicators

Parameter settings: Within the proposed SST approach, the model hyperparameters are specified in , and their values are determined based on the results of ablation experiments (as detailed in Section 4.3). Additionally, LeakyReLU is used as the activation function. The optimizer for the network is set to adaptive moment estimation (Adam) (Kingma and Ba Citation2014) with an initial learning rate of 0.0005. Due to the memory limitations of the computer graphics processor, the batch size during training is set to 4. All experiments are implemented using the PyTorch framework with an NVIDIA GeForce RTX 3090 GPU and 24 GB of memory, the training time is approximately 36 h.
Evaluation indicators: We selected four widely used image quality assessment metrics: the mean peak signal-to-noise ratio (MPSNR) (Huynh-Thu and Ghanbari Citation2008), mean structure similarity (MSSIM) (Wang et al. Citation2004), spectral angle mapper (SAM) (Shuai et al. Citation2018), and cross-correlation (CC) (Loncan et al. Citation2015); these were used to evaluate the performance of our proposed algorithm for HSI super-resolution. MPSNR and MPSNR and MSSIM are important metrics used to assess the similarity and structural consistency between real and generated images. They provide a measure of spatial visual quality that is obtained by averaging over all spectral bands. A higher MPSNR value indicates better visual quality, while an MSSIM value of 1 represents optimal quality. Moreover, SAM and CC are indicators of the spectral fidelity in images, with an optimal value of 0 for SAM and 1 for CC.

Table 1. The hyperparameter configuration of the SST model.

Download CSV Display Table

4.3. Ablation analysis

To assess the effectiveness of the proposed algorithm, a series of ablation experiments were conducted using the validation set from the Houston dataset.

First, we conducted three sets of control experiments using a scale factor of 2 to examine the advantages of each module in the integrated transformer block. In the spatial transformer model (STM), only the spatial transformer block was used to extract spatial features between HSI pixels, without considering the spectral features between each band; this was also the approach adopted in general image super-resolution tasks. In the spectral and spatial transformer concatenated model (SSTM), both spectral and spatial features were considered, and the two features were calculated by two feature modules in series. The 3D transformer model (3DTM) simultaneously calculated the spectral and spatial features in a three-dimensional form. Notably, the window size was set to 3 × 3 in the 3DTM experiments due to GPU memory limitations. The experimental results are provided in .

Table 2. Experimental results from the different network structures on the Houston dataset (scale factor 2). The bolded items represent the best experimental results.

Download CSV Display Table

shows that SSTM achieves the best overall performance, while STM shows the worst. After incorporating the calculation of the spectral features, the improvement in the spectral fidelity metrics, SAM and CC, shows the importance of considering the spectral features in hyperspectral remote sensing. The experimental results from the 3DTM are very close to those of the SSTM, which may be due to the window size being set to 3. Window size is also an important parameter that affects the experimental results. From the analysis in (b), evidently, when the window size H > 6, the experimental performance of 3DTM surpasses that of SSTM. Since computational cost is also a crucial factor in assessing the model performance, subsequent comparative experiments are based on using SSTM as the representative model for SST.

Figure 6. Experimental results from the 3DTM and SSTM under different parameters (on the Houston dataset, scaling factor: 4). (a) Multiple σ of the number of bands during embedding; (b) value H of the window size; (c) number K of residual groups; (d) number M of integrated transformer blocks in each residual block.

Using SSTM with a scaling factor of 4, we investigated the impact of various parameters on HSI super-resolution, including the multiple σ of the number of bands during embedding, the window size value H, the number K of residual groups, and the number M of integrated transformer blocks in each residual block. The experimental results are shown in . For the multiple of the number of embedded bands ((a)), the super-resolution effect gradually improved with an increase in σ. The optimization of SAM and MPSNR was particularly significant in the range of σ = 3 to 9, and the optimization effect was reduced in the range of σ = 9 to 15. These results indicated that changing the embedding dimension could improve the super-resolution effect of the model. However, a too large σ will increase the model's complexity and computational cost. Therefore, in SST, σ was set to 9. Regarding the size of the segmentation window ((b)), the model's effect increased with an increase in the window size value H. After considering the computational cost and model effect, in SST, H was set to 8. For the number of residual groups ((c)), the model effect first increased and then decreased with an increase in K. The inflection point occurred at K = 6, and the decline in the model's effect was potentially due to network degradation caused by a too deep network depth or overfitting to the training set. Therefore, in SST, K was set to 6. Regarding the number of integrated transformer blocks ((d)), the model's performance gradually improved with an increase in the value of M. However, the model's effect improvement was reduced when M > 6. To save computational costs, in SST, M was set to 6.

Figure 7. Experimental results from the SSTM under different parameters on the Houston dataset (scaling factor: 4). (a) Multiple σ of the number of bands during embedding; (b) value H of the window size; (c) number K of residual groups; (d) number M of integrated transformer blocks in each residual block.

4.4. Comparative experiments

In this section, we conduct comparative experiments using seven methods with scale factors of 2, 4, and 8 on the three datasets mentioned above. The seven methods are bicubic, deep distillation recursive network (DDRN) (Jiang et al. Citation2018), spatial-spectral prior network (SSPSR) (Jiang et al. Citation2020), Interactformer (Liu, Hu, et al. Citation2022), Swin transformer for image restoration model (SwinIR) (Liang et al. Citation2021), enhanced encoder-decoders generative adversarial network (EEGAN) (Jiang et al. Citation2019) and dual self-attention Swin transformer SR (DSSTSR) (Long et al. Citation2023). Bicubic is a classical image processing method that is used in each band of remote sensing images to achieve HSI super-resolution. DDRN and SSPSR are CNN-based methods for HSI super-resolution. Notably, only scaling factors 4 and 8 are used in the comparison experiment with SSPSR due to the progressive upsampling method used in the SSPSR model. Interactformer, SwinIR, and DSSTSR are all super-resolution models based on the Transformer architecture, while EEGAN is a GAN-based HSI super-resolution method. We select the SSTM method with the best effect () among the three methods of STM, SSTM, and 3DTM for HSI SR comparison experiments.

(1)

Pavia Centre Dataset. presents the evaluation results from different super-resolution algorithms with various scaling factors on the Pavia Centre dataset. Our proposed SST method achieves the best performance in each scaling factor, while the bicubic interpolation method exhibits the worst results. These results indicate that deep learning methods are more effective than traditional interpolation methods. Among the several algorithms used in machine learning, the Interactformer, SwinIR and DSSTSR using the transformer structure and the EEGAN algorithm using GAN have higher MPSNR and MSSIM indicators than DDRN and SSPSR using the CNN structure. This result is consistent with the performances of the related algorithms in other fields, such as classification (Dosovitskiy et al. Citation2021) and segmentation (Chen et al. Citation2021). This potentially occurs because the transformer and GAN models can extract more global features, leading to better model fitting ability.

Table 3. Quantitative comparison on the Pavia Centre dataset. The bolded items represent the best experimental results; the optimal standard deviation of experimental results and the SST algorithm's standard deviation are in parentheses (the same below).

Download CSV Display Table

, , and show the super-resolution results, errors, and standard deviation, respectively, from the Pavia Centre dataset at a scale factor of 4. Evidently, from these figures, the SST algorithm produces overall good results, while the bicubic algorithm performs poorly in both the overall and detailed results. DDRN and SwinIR exhibit chessboard artifacts, and the details of SSPSR and Interactformer, such as building boundaries, are slightly distorted. Notably, models based on the transformer architecture not only demonstrate similar experimental results but also exhibit excellent performance.

(2)

Houston Dataset. displays the evaluation results from different super-resolution methods with varying scaling factors on the Houston dataset. As shown in the table, with the increase in the scaling factors, the performance of all algorithms gradually decreases. However, SST outperforms the other models in all evaluation metrics. Notably, the SwinIR algorithm shows significantly inferior results in SAM and CC metrics compared to other machine learning algorithms. This occurs because SwinIR is not a specifically designed super-resolution algorithm for remote sensing images, and it lacks the spectral feature extraction correlation structure of HSI super-resolution algorithms.

Figure 8. Super-resolution results from the Pavia Centre dataset (scale factor: 4, red: 90, green: 60, blue: 30). (a) Ground truth; (b) Bicubic; (c) DDRN; (d) SSPSR; (e) Interactformer; (f) SwinIR; (g) EEGAN; (h) DSSTSR; (i) SST.

Figure 9. Error results between the real and reconstructed images in the Pavia Centre dataset (scale factor: 4). (a) Bicubic; (b) DDRN; (c) SSPSR; (d) Interactformer; (e) SwinIR; (f) EEGAN; (g) DSSTSR; (h) SST.

Figure 10. Standard deviation results between the real and reconstructed images in the Pavia Centre dataset (scale factor: 4). (a) Bicubic; (b) DDRN; (c) SSPSR; (d) Interactformer; (e) SwinIR; (f) EEGAN; (g) DSSTSR; (h) SST.

Table 4. Quantitative comparison on the Houston dataset. The bolded items represent the best experimental results.

Download CSV Display Table

displays the super-resolution results from the Houston dataset with a scale factor of 4. Evidently, that among the algorithms tested in the experiment, the SST algorithm produces relatively better super-resolution results, as it effectively restores the edge features of the ground objects. The Interactformer and DSSTSR models, based on the Transformer architecture, exhibit suboptimal results. Conversely, SwinIR's super-resolution images differ significantly from the ground truth in color, which may be attributed to the lack of a spectral feature extraction module in its model. However, all algorithms in the experiment exhibit some limitations in the restoration of the details, such as the inability to restore the black blob on top of the building in the image. Based on the error image (), the SST algorithm has fewer errors, as shown by the larger dark blue areas. The standard deviation image () also reflects the greater stability of the SST algorithm, which is particularly evident within the red box.

(3)

Chikusei Dataset. provided the evaluation results from different super-resolution methods on the Chikusei dataset with different scaling factors. The values of each indicator in the table are superior to the results from the other two datasets. Specifically, each algorithm achieves a higher MPSNR value, MSSIM value, and CC value and a lower SAM value. The reasons for these results may be as follows: (1) In terms of spatial features, the Chikusei dataset is mainly composed of rural areas, and the ground object types are relatively uniform, which reduces the difficulty of super-resolution. (2) In terms of spectral features, due to the high proportion of crops in the Chikusei dataset, the spectral response curves of each pixel are similar, which facilitates the model to learn its spectral features and achieve higher spectral fidelity. Moreover, the SST algorithm performs better than the other algorithms for each model evaluation result, and the results from each algorithm are consistent with the other two datasets.

Figure 11. Super-resolution results from the Houston dataset (scale factor: 4, red: 120, green: 80, blue: 60). (a) Ground truth; (b) Bicubic; (c) DDRN; (d) SSPSR; (e) Interactformer; (f) SwinIR; (g) EEGAN; (h) DSSTSR; (i) SST.

Figure 12. Error results between the real and reconstructed images in the Houston dataset (scale factor: 4). (a) Bicubic; (b) DDRN; (c) SSPSR; (d) Interactformer; (e) SwinIR; (f) EEGAN; (g) DSSTSR; (h) SST.

Figure 13. Standard deviation results between the real and reconstructed images in the Houston dataset (scale factor: 4). (a) Bicubic; (b) DDRN; (c) SSPSR; (d) Interactformer; (e) SwinIR; (f) EEGAN; (g) DSSTSR; (h) SST.

Table 5. Quantitative comparison on the Chikusei dataset. The bolded items represent the best experimental results.

Download CSV Display Table

illustrates the super-resolution results from the Chikusei dataset with a scale factor of 4. Evidently, from this figure, the performances of the different machine learning algorithms are relatively similar, indicating that the overall effect of each algorithm on the Chikusei dataset is excellent, as corroborated by the evaluation indicators in . However, in regard to capturing details, the SST, SSPSR, Interactformer, DSSTSR and EEGAN algorithms clearly outperform other methods. On the other hand, bicubic and SwinIR perform comparatively poorly in terms of details, this particularly applied in areas with substantial image changes, such as within the red box of the image and the error map. ( and ).

Figure 14. Super-resolution results from the Chikusei dataset (scale factor: 4, red: 90, green: 60, blue: 30). (a) Ground truth; (b) Bicubic; (c) DDRN; (d) SSPSR; (e) Interactformer; (f) SwinIR; (g) EEGAN; (h) DSSTSR; (i) SST.

Figure 15. Error results between the real and reconstructed images in the Chikusei dataset (scale factor: 4). (a) Bicubic; (b) DDRN; (c) SSPSR; (d) Interactformer; (e) SwinIR; (f) EEGAN; (g) DSSTSR; (h) SST.

, , and show the standard deviation plots for the three datasets. These visuals clearly illustrate that among the methods examined, SST has lower overall error and standard deviation. This observation highlights the superior precision and stability of the SST method in terms of results. Notably, the disparities within the red-bordered regions in the graphs are particularly evident. In , the values in parentheses represent the standard deviation corresponding to each average metric. Based on the experimental results, the distribution of the best standard deviation appears to be almost random. This suggests that the stability of each selected comparative algorithm in the experiment is very similar. From another perspective, the standard deviation results for the Chikusei dataset are generally better than the other two datasets. This may also be attributed to the relatively smooth variations in the test data images of this dataset, allowing each method to achieve better results.

Figure 16. Standard deviation results between the real and reconstructed images in the Chikusei dataset (scale factor: 4). (a) Bicubic; (b) DDRN; (c) SSPSR; (d) Interactformer; (e) SwinIR; (f) EEGAN; (g) DSSTSR; (h) SST.

displays the absolute differences between the computed and actual pixel values in the different bands of a pixel in the result images obtained from the experiments for the three datasets using various super-resolution algorithms. The pixel is randomly selected from the three images, and its value represents the radiance intensity at that location as measured by the sensor. The pixel values are represented as 16-bit nonsigned integers, with a theoretical maximum value of 65,535. provides insights into the spectral fidelity of each super-resolution algorithm. The gray dotted line in the figure represents the SST algorithm. Evidently, in the experiments on the three datasets, the curve of our algorithm is mostly located below that of the other algorithms across most wavelength intervals; this result is consistent with the results from the SAM and CC evaluation indices.

Figure 17. Absolute value of the difference between the calculated value and the real value of a pixel. (a) Pavia dataset; (b) Houston dataset; (c) Chikusei dataset.

Notably, the uneven distribution of the pixel values in the training data can significantly impact the model's accuracy. In (a), an evident spike in the absolute differences of pixels with values exceeding 70 on the x-axis is observed. Similarly, in (c), this spike occurs for values exceeding 110. Furthermore, in , , and , areas with higher radiance intensity values exhibit elevated error and standard deviation values compared to their surroundings. This occurs because regions with higher radiance intensity constitute a smaller proportion of the training data, resulting in the model not fully capturing the relevant feature mappings. This pattern is observed across several of the comparative methods selected for the experiments. While the SST method does not completely resolve this issue, it does demonstrate a relative advantage, as shown by the red boxes in the error and standard deviation plots.

4.5. SR of the multispectral images in real scenarios

To assess the applicability of our proposed algorithm in practical scientific research and production, a series of validation experiments were conducted. In the specific context, we aimed to investigate the long-term wetland landscape evolution within Ruoergai National Park in China, located at the northeast edge of the Qinghai-Tibet Plateau. To do this, the classification of remote sensing images from a time series within a given time interval needed to be performed. In this study, the initial intent was to use the Landsat series surface reflectance data for experimentation. However, due to the absence of a panchromatic band in the Landsat 5 data, it was not possible to perform panchromatic band fusion, resulting in images with different spatial resolutions during the study period. To ensure a uniform scale for subsequent wetland landscape classification tasks, the spatial resolution of the Landsat 5 data needed to be enhanced from 30 to 15 m.

Specifically, we utilized the Google Earth Engine platform for data acquisition and preprocessing. We generated a 30 m resolution remote sensing image for the Ruoergai National Park region, measuring 10,680 × 11,027 pixels. We selected six electromagnetic wave bands, and the image was from 2001. The super-resolution model chosen was the SSTM. For model training, the paired image data before and after fusion obtained from 2018 to 2023 were used. The 2001 image with 30 m resolution served as the test data. Notably, no ground-truth data was available in the test data; instead, we directly used a model to generate an image with dimensions of 21,360 × 22,054, upscaled by a factor of 2. (b,c) illustrate local images within the experimental area before and after super-resolution, respectively. As evident from the images, the enhancement in spatial resolution after super-resolution was quite noticeable.

Figure 18. Real-world multispectral image super-resolution and classification results. (a) Wetland landscape classification results in Ruoergai National Park; (b) local area low-resolution image (30 m, true color); (c) local area high-resolution image (15 m, true color); (d) landscape classification results in the local area.

Subsequently, we employed a self-supervised approach (Hamilton et al. Citation2022) to perform pixel-based wetland landscape classification on the super-resolved images. The categories included river-lake, marsh, swamp meadow, meadow, bare sand, and building-road. (a) presents the classification results for the entire experimental area, while lists the precision, recall, and F1 scores for each category. Our results indicated that the classification performance of the images processed with the SST algorithm was excellent, demonstrating the algorithm's strong spectral fidelity. These results support its applicability for subsequent remote sensing interpretation tasks.

Table 6. The results from the wetland landscape classification in Ruoergai National Park using SR images.

Download CSV Display Table

4.6. Computational cost

In this subsection, we present a comparison of the computational cost of all models using two indicators: the total number of parameters of the model and the computational time. The experiments were conducted on the Chikusei dataset with a scale factor of 4, and all models were implemented using PyTorch on an NVIDIA GeForce RTX 3090 GPU. provides the total number of parameters versus the computation time for each algorithm, with the size of the bubbles indicating the relative magnitude of the MPSNR values. From this table, the super-resolution algorithms using Transformer, such as SST, SwinIR, and Interactformer, were not dominant in computing time. In contrast, CNN-based algorithms had a better computing time, but their super-resolution effects were not as good as the transformer-based algorithms. Therefore, achieving both accuracy and efficiency for the current algorithm is a challenge that will be addressed in future work.

Table 7. Comparison of the computational cost and super-resolution effect on the Chikusei dataset (scale factor 4).

Download CSV Display Table

5 Discussion

5.1. Cases using SSTM and 3DTM

SST utilizes two methods for the concurrent extraction of spectral and spatial features: SSTM and 3DTM. Since both serve the same purpose, they are alternatives in application. However, due to their distinct structural designs (as depicted in ), the experimental outcomes also differ. In our experiments, we observed that when the window size was set to a larger value, 3DTM outperformed SSTM. Nevertheless, 3DTM has a trade-off of higher computational resource requirements. Hence, in practical application scenarios, it is advisable to distinguish between the two methods. If ample computational resources are available and the objective is to achieve superior super-resolution performance, the 3D-MSA structure may be the preferred choice. However, if quick super-resolution results are desired, using the spectral-MSA and spatial-MSA structures can also yield acceptable outcomes.

5.2. Application potential of SST

In this study, a transformer-based remote sensing image super-resolution algorithm was introduced and outperformed other methods in comparative experiments. Additionally, the algorithm's practicality was demonstrated in the subsequent large-scale remote sensing classification experiments. In terms of computational cost, when using the SSTM structure, the SST algorithm took approximately 15 min to perform a super-resolution task with a scale factor of 2 on an image sized 10,680 × 11,027 using an RTX 3090 GPU. While the time needed was not the most efficient, it was suitable for practical research and production tasks. Our proposed algorithm possesses practical value in enhancing the quality of remote sensing images while preserving the spectral fidelity. Evidently, this algorithm has considerable potential for applications in diverse fields, including image interpretation, disaster monitoring, environmental protection, and more.

5.3. Inspiration for future research

The success of the transformers in natural language processing tasks has led researchers to recognize their potential in feature extraction, but they require customized improvements for different tasks. Our proposed SST remote sensing image super-resolution method demonstrates that this algorithm can be applied to these tasks. The key lies in designing the appropriate computational modules for extracting spectral-spatial features to ensure excellent results in both spatial resolution and spectral fidelity. This is the crucial factor leading to the variation in the outcomes among different methods. Therefore, for future research, one aspect involves performing reasonable enhancements to other algorithms from other fields and applying them to remote sensing super-resolution tasks. Moreover, this involves delving into the spectral-spatial structure, elucidating its computational principles, and making the targeted improvements.

5.4. Limitations and directions for improvements

While SST achieved favorable experimental results and could be used for real-world remote sensing image super-resolution tasks, it also had notable limitations. First, a common issue related to the transformer-based structure existed and involved high computational requirements. This issue was more pronounced in the 3DTM structure ( and ). Therefore, reducing the algorithm's computational demands is one of the key areas for future research. This may involve modifying the transformer structure or redesigning the network architecture. Furthermore, in terms of model stability, there is no significant improvement compared to other benchmark algorithms; they generally maintain a similar level (). This underscores a shared challenge that algorithms of this type need to collectively address. Lastly, the effectiveness of super-resolution with large scale factors remains limited (the scale factors are 8, 16, and 32). This limitation is due to the insufficient information provided by a single low-resolution image, rendering SST less useful in such cases. Overcoming this challenge is also a focus for future work.

6 Conclusions

To address the super-resolution problem of HSIs, in this study, an innovative super-resolution algorithm, SST, which integrated both spatial and spectral transformers, was presented. The spatial-transformer and spectral-transformer architectures were utilized to extract the spatial and spectral features of the input images, respectively. These features were then integrated for the super-resolution of HSIs. Furthermore, a 3D transformer structure was devised to concurrently extract both spatial and spectral features from the input images. Through experiments and analysis, the serial structure of the spectral and spatial transformer was used as the representative SST method. By conducting comparative experiments on three publicly available HSI datasets and based on multiple evaluation metrics and super-resolution result images, the transformer-based SST algorithm effectively enhanced image reconstruction when compared with other algorithms. Our experimental results showed the significant role of the spectral-MSA block in enabling the model to extract crucial spectral features. Furthermore, our proposed algorithm was applied to real-world super-resolution experiments within the region of China's Ruoergai National Park, and subsequently, pixel-based classification was conducted on the experimental outcomes; based on this, SST exceled in both preserving spectral fidelity and achieving spatial restoration. This result showed SST’s potential application in future remote sensing interpretation tasks. However, the computational cost showed that SST was not dominant in computing time; in the future, this is one of the main directions for improving our proposed super-resolution algorithm.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The datasets that have been utilized to support the research in this paper, namely, Pavia Centre, Houston, and Chikusei, are available for download from https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_University_scene, http://www.grss-ieee.org/community/technical-committees/data-fusion/2013-ieee-grss-data-fusion-contest, and https://naotoyokoya.com/Download.html, respectively. Additionally, the self-constructed dataset for Ruoergai National Park has already been uploaded to https://gitee.com/a_small_tree_of_joy/sst.

Additional information

Funding

This study was supported by the Project of the Sichuan Science and Technology Program [grant number 2023YFS0499], the National Major Science and Technology Projects of China for the High-resolution Earth Observation System [grant number 87-Y50G28-9001-22/23], and the Aba Science and Technology Projects of Sichuan Province [grant number R23YYJSYJ0006].

References

Arefin, M. R., V. Michalski, P. L. St-Charles, A. Kalaitzis, S. Kim, S. E. Kahou, and Y. Bengio. 2020. “Multi-image Super-Resolution for Remote Sensing Using Deep Recurrent Networks.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, 206–207.
Google Scholar
Bedini, E. 2017. “The use of Hyperspectral Remote Sensing for Mineral Exploration: A Review.” Journal of Hyperspectral Remote Sensing 7 (4): 189–211. https://doi.org/10.29150/jhrs.v7.4.p189-211
Google Scholar
Carion, N., F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. 2020. “End-to-end Object Detection with Transformers.” In European Conference on Computer Vision (ECCV), Glasgow, US, 213–229.
Google Scholar
Chai, X., Y. Wang, X. Chen, Z. Gan, and Y. Zhang. 2022. “TPE-GAN: Thumbnail Preserving Encryption Based on GAN with Key.” IEEE Signal Processing Letters 29: 972–976. https://doi.org/10.1109/LSP.2022.3163685
Web of Science ®Google Scholar
Chen, Z., Y. Cao, J. Su, and J. Lu. 2021. “TransUnet: Transformers Make Strong Encoders for Medical Image Segmentation.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14827–14836. Online.
Google Scholar
Chen, H., X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, and C. Zhu. 2022. “Real-World Single Image Super-Resolution: A Brief Review.” Information Fusion 79: 124–145. https://doi.org/10.1016/j.inffus.2021.09.005
Web of Science ®Google Scholar
Chen, Y., L. Xu, Y. Fang, J. Peng, W. Yang, A. Wong, and D. A. Clausi. 2020. “Unsupervised Bayesian Subpixel Mapping of Hyperspectral Imagery Based on Band-Weighted Discrete Spectral Mixture Model and Markov Random Field.” IEEE Geoscience and Remote Sensing Letters 18 (1): 162–166. https://doi.org/10.1109/LGRS.2020.2967104
Web of Science ®Google Scholar
Chen, W., X. Zheng, and X. Lu. 2021. “Hyperspectral Image Super-Resolution with Self-Supervised Spectral-Spatial Residual Network.” Remote Sensing 13 (7): 1260. https://doi.org/10.3390/rs13071260
Web of Science ®Google Scholar
Debes, C., A. Merentitis, R. Heremans, J. Hahn, N. Frangiadakis, T. van Kasteren, Liao Wenzhi, Bellens Rik, Pizurica Aleksandra, Gautama Sidharta, Philips Wilfried, Prasad Saurabh, Du Qian, F. Pacifici. 2014. “Hyperspectral and LiDAR Data Fusion: Outcome of the 2013 GRSS Data Fusion Contest.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6): 2405–2418. https://doi.org/10.1109/JSTARS.2014.2305441
Web of Science ®Google Scholar
Deepa, F. R., and M. J. Islam. 2020. “Effect of Atmospheric Turbulence on the Performance of Underwater Wireless SAC-OCDMA System.” In 2020 2nd International Conference on Sustainable Technologies for Industry 4.0 (STI), 1–5. Online.
Google Scholar
Dian, R., S. Li, B. Sun, and A. Guo. 2021. “Recent Advances and New Guidelines on Hyperspectral and Multispectral Image Fusion.” Information Fusion 69: 40–51. https://doi.org/10.1016/j.inffus.2020.11.001
Web of Science ®Google Scholar
Ding, H. Y., and Z. F. Bian. 2008. “Remote Sensed Image Super-Resolution Reconstruction Based on a BP Neural Network.” Computer Engineering and Applications 44 (1): 171–172.
Google Scholar
Dong, L., S. Huang, F. Wei, M. Lapata, M. Zhou, and K. Xu. 2017. “Learning to Generate Product Reviews from Attributes.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 623–632.
Google Scholar
Dong, C., C. C. Loy, K. He, and X. Tang. 2015. “Image Super-Resolution Using Deep Convolutional Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2): 295–307. https://doi.org/10.1109/TPAMI.2015.2439281
Web of Science ®Google Scholar
Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, and N. Houlsby. 2021. “An Image is Worth 16×16 words: Transformers for Image Recognition at Scale.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10687–10698.
Google Scholar
Fu, Y., Z. Liang, and S. You. 2021. “Bidirectional 3D Quasi-Recurrent Neural Network for Hyperspectral Image Super-Resolution.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14: 2674–2688. https://doi.org/10.1109/JSTARS.2021.3057936
Web of Science ®Google Scholar
Gao, Y., W. Li, M. Zhang, J. Wang, W. Sun, R. Tao, and Q. Du. 2021. “Hyperspectral and Multispectral Classification for Coastal Wetland Using Depthwise Feature Interaction Network.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–15.
Google Scholar
Gong, L., J. M. Crego, and J. Senellart. 2019. “Enhanced Transformer Model for Data-to-Text Generation.” In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, People’s Republic of China, 148–156.
Google Scholar
Gong, Z., N. Wang, D. Cheng, X. Jiang, J. Xin, X. Yang, and X. Gao. 2022. “Learning Deep Resonant Prior for Hyperspectral Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–14.
Google Scholar
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. “Generative Adversarial Networks.” In Advances in Neural Information Processing Systems, Montreal, Canada, 2672–2680.
Google Scholar
Govil, H., G. Mishra, N. Gill, A. Taloor, and P. Diwan. 2021. “Mapping Hydrothermally Altered Minerals and Gossans Using Hyperspectral Data in Eastern Kumaon Himalaya, India.” Applied Computing and Geosciences 9: 100054. https://doi.org/10.1016/j.acags.2021.100054
Google Scholar
Hamilton, M., Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman. 2022. Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414.
Google Scholar
Han, S. Y., N. H. Park, and K. H. Joo. 2015. “Wavelet Transform Based Image Interpolation for Remote Sensing Image.” International Journal of Software Engineering and Its Applications 9 (2): 59–66.
Google Scholar
He, J., Q. Yuan, J. Li, Y. Xiao, X. Liu, and Y. Zou. 2022. “DsTer: A Dense Spectral Transformer for Remote Sensing Spectral Super-Resolution.” International Journal of Applied Earth Observation and Geoinformation 109: 102773. https://doi.org/10.1016/j.jag.2022.102773
Web of Science ®Google Scholar
He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 770–778.
Google Scholar
Hochreiter, S., and J. Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
PubMed Web of Science ®Google Scholar
Huang, X., and L. Zhang. 2009. “A Comparative Study of Spatial Approaches for Urban Mapping Using Hyperspectral ROSIS Images Over Pavia City, Northern Italy.” International Journal of Remote Sensing 30 (12): 3205–3221. https://doi.org/10.1080/01431160802559046
Web of Science ®Google Scholar
Huynh-Thu, Q., and M. Ghanbari. 2008. “Scope of Validity of PSNR in Image/Video Quality Assessment.” Electronics Letters 44 (13): 800–801. https://doi.org/10.1049/el:20080522
Web of Science ®Google Scholar
Irmak, H., G. B. Akar, S. E. Yuksel, and H. Aytaylan. 2016. “Super-resolution Reconstruction of Hyperspectral Images via an Improved map-Based Approach.” In 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, People’s Republic of China, 7244–7247.
Google Scholar
Jia, J., L. Ji, Y. Zhao, and X. Geng. 2018. “Hyperspectral Image Super-Resolution with Spectral–Spatial Network.” International Journal of Remote Sensing 39 (22): 7806–7829. https://doi.org/10.1080/01431161.2018.1471546
Web of Science ®Google Scholar
Jiang, X., Y. Li, T. Jiang, J. Xie, Y. Wu, Q. Cai, J. Jinhui, X. Jiaming, H. Zhang. 2022. “RoadFormer: Pyramidal Deformable Vision Transformers for Road Network Extraction with Remote Sensing Images.” International Journal of Applied Earth Observation and Geoinformation 113: 102987. https://doi.org/10.1016/j.jag.2022.102987
Web of Science ®Google Scholar
Jiang, J., H. Sun, X. Liu, and J. Ma. 2020. “Learning Spatial-Spectral Prior for Super-Resolution of Hyperspectral Imagery.” IEEE Transactions on Computational Imaging 6: 1082–1096. https://doi.org/10.1109/TCI.2020.2996075
Web of Science ®Google Scholar
Jiang, J., C. Wang, X. Liu, K. Jiang, and J. Ma. 2022. “From Less to More: Spectral Splitting and Aggregation Network for Hyperspectral Face Super-Resolution.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 267–276.
Google Scholar
Jiang, K., Z. Wang, P. Yi, J. Jiang, J. Xiao, and Y. Yao. 2018. “Deep Distillation Recursive Network for Remote Sensing Imagery Super-Resolution.” Remote Sensing 10 (11): 1700. https://doi.org/10.3390/rs10111700
Web of Science ®Google Scholar
Jiang, K., Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang. 2019. “Edge-enhanced GAN for Remote Sensing Image Superresolution.” IEEE Transactions on Geoscience and Remote Sensing 57 (8): 5799–5812. https://doi.org/10.1109/TGRS.2019.2902431
Web of Science ®Google Scholar
Jingmeng, Wang, Aiwu Zhang, Xiangang Meng, and Zhao Liu. 2015. “Super-resolution Reconstruction of Remote Sensing Image Based on Staggered Pixels and non-Uniform B-Spline Curved Surface.” Remote Sensing for Natural Resources 27 (1): 35–43.
Google Scholar
Kingma, D. P., & Ba, J. (2014). “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980.
Google Scholar
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. https://doi.org/10.1109/5.726791
Web of Science ®Google Scholar
Lei, S., Z. Shi, and W. Mo. 2021. “Transformer-based Multistage Enhancement for Remote Sensing Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–11.
Web of Science ®Google Scholar
Lei, S., Z. Shi, and Z. Zou. 2017. “Super-resolution for Remote Sensing Images via Local–Global Combined Network.” IEEE Geoscience and Remote Sensing Letters 14 (8): 1243–1247. https://doi.org/10.1109/LGRS.2017.2704122
Web of Science ®Google Scholar
Li, S., R. Dian, L. Fang, and J. M. Bioucas-Dias. 2018. “Fusing Hyperspectral and Multispectral Images via Coupled Sparse Tensor Factorization.” IEEE Transactions on Image Processing 27 (8): 4118–4130. https://doi.org/10.1109/TIP.2018.2836307
Web of Science ®Google Scholar
Li, X., H. Shen, L. Zhang, H. Zhang, Q. Yuan, and G. Yang. 2014. “Recovering Quantitative Remote Sensing Products Contaminated by Thick Clouds and Shadows Using Multitemporal Dictionary Learning.” IEEE Transactions on Geoscience and Remote Sensing 52 (11): 7086–7098. https://doi.org/10.1109/TGRS.2014.2307354
Web of Science ®Google Scholar
Li, Q., Q. Wang, and X. Li. 2021. “Exploring the Relationship Between 2D/3D Convolution for Hyperspectral Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 59 (10): 8693–8703. https://doi.org/10.1109/TGRS.2020.3047363
Web of Science ®Google Scholar
Li, K., S. Yang, R. Dong, X. Wang, and J. Huang. 2020. “Survey of Single Image Super-Resolution Reconstruction.” IET Image Processing 14 (11): 2273–2290. https://doi.org/10.1049/iet-ipr.2019.1438
Web of Science ®Google Scholar
Liang, J., J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte. 2021. “Swinir: Image Restoration Using Swin Transformer.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1833–1844. Online.
Google Scholar
Liu, X., T. Feng, X. Shen, and R. Li. 2022. “PMDRnet: A Progressive Multiscale Deformable Residual Network for Multi-Image Super-Resolution of AMSR2 Arctic Sea Ice Images.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–18.
Web of Science ®Google Scholar
Liu, Y., J. Hu, X. Kang, J. Luo, and S. Fan. 2022. “Interactformer: Interactive Transformer and CNN for Hyperspectral Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–15.
Web of Science ®Google Scholar
Liu, Y., J. Hu, X. Kang, J. Luo, and S. Fan. 2022. “Interactformer: Interactive Transformer and CNN for Hyperspectral Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–15.
Web of Science ®Google Scholar
Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” arXiv preprint arXiv:2103.14030.
Google Scholar
Liu, Y., Y. Wang, N. Li, X. Cheng, Y. Zhang, Y. Huang, and G. Lu. 2018. “An Attention-Based Approach for Single Image Super Resolution.” In 2018 24th International Conference on Pattern Recognition (ICPR), edited by D. Lopresti and R. He, 2777–2784. Beijing, People’s Republic of China: IEEE.
Google Scholar
Loncan, L., L. B. De Almeida, J. M. Bioucas-Dias, X. Briottet, J. Chanussot, N. Dobigeon, Sophie Fabre, et al. 2015. “Hyperspectral Pansharpening: A Review.” IEEE Geoscience and Remote Sensing Magazine 3 (3): 27–46. https://doi.org/10.1109/MGRS.2015.2440094.
Web of Science ®Google Scholar
Long, Y., X. Wang, M. Xu, S. Zhang, S. Jiang, and S. Jia. 2023. “Dual Self-Attention Swin Transformer for Hyperspectral Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–12.
Google Scholar
Lu, Z., J. Li, H. Liu, C. Huang, L. Zhang, and T. Zeng. 2022. “Transformer for Single Image Super-Resolution.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 457–466. New Orleans, USA.
Google Scholar
Ma, Q., J. Jiang, X. Liu, and J. Ma. 2023. “Learning a 3D-CNN and Transformer Prior for Hyperspectral Image Super-Resolution.” Information Fusion 100: 101907. https://doi.org/10.1016/j.inffus.2023.101907
Web of Science ®Google Scholar
Navin, M. S., and L. Agilandeeswari. 2020. “Multispectral and Hyperspectral Images Based Land use / Land Cover Change Prediction Analysis: An Extensive Review.” Multimedia Tools and Applications 79 (39-40): 29751–29774. https://doi.org/10.1007/s11042-020-09531-z.
Web of Science ®Google Scholar
Nguyen, C., V. Sagan, M. Maimaitiyiming, M. Maimaitijiang, S. Bhadra, and M. T. Kwasniewski. 2021. “Early Detection of Plant Viral Disease Using Hyperspectral Imaging and Deep Learning.” Sensors 21 (3): 742. https://doi.org/10.3390/s21030742
PubMed Web of Science ®Google Scholar
Niu, C., K. Tan, X. Jia, and X. Wang. 2021. “Deep Learning Based Regression for Optically Inactive Inland Water Quality Parameter Estimation Using Airborne Hyperspectral Imagery.” Environmental Pollution 286: 117534. https://doi.org/10.1016/j.envpol.2021.117534
PubMed Web of Science ®Google Scholar
Raghu, M., C. Zhang, J. Kleinberg, and S. Bengio. 2021. “Transfusion: Understanding Transfer Learning for Medical Imaging.” In International Conference on Machine Learning (ICML), 8102–8113. Online.
Google Scholar
Shi, Y., L. Han, L. Han, S. Chang, T. Hu, and D. Dancey. 2022. “A Latent Encoder Coupled Generative Adversarial Network (le-gan) for Efficient Hyperspectral Image Super-Resolution.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–19.
Google Scholar
Shuai, Y., Y. Wang, Y. Peng, and Y. Xia. 2018. “Accurate Image Super-Resolution Using Cascaded Multi-Column Convolutional Neural Networks.” In 2018 IEEE International Conference on Multimedia and Expo (ICME), 1–6. San Diego, USA.
Google Scholar
Solanki, P., D. Israni, and A. Shah. 2018. “An Efficient Satellite Image Super Resolution Technique for Shift-Variant Images Using Improved new Edge Directed Interpolation.” Statistics, Optimization & Information Computing 6 (4): 619–632. https://doi.org/10.19139/soic.v6i4.433
Google Scholar
Stuart, M. B., A. J. McGonigle, and J. R. Willmott. 2019. “Hyperspectral Imaging in Environmental Monitoring: A Review of Recent Developments and Technological Advances in Compact Field Deployable Systems.” Sensors 19 (14): 3071. https://doi.org/10.3390/s19143071
PubMed Web of Science ®Google Scholar
Tao, L., Q. Feng, and Z. Bao. 2018. “MAP Super-Resolution Reconstruction of Remote Sensing Image.” Chinese Journal of Liquid Crystals and Displays 33 (10): 884–892. https://doi.org/10.3788/YJYXS20183310.0884
Google Scholar
Tao, H., X. Tang, J. Liu, and J. Tian. 2003. “Superresolution Remote Sensing Image Processing Algorithm Based on Wavelet Transform and Interpolation.” In Vol. 4898 of Image Processing and Pattern Recognition in Remote Sensing, 259–263. SPIE.
Google Scholar
Touvron, H., A. Vedaldi, M. Douze, and H. Jégou. 2021. “Training Data-Efficient Image Transformers & Distillation Through Attention.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1781–1790. Online.
Google Scholar
Tu, J., G. Mei, Z. Ma, and F. Piccialli. 2022. “SWCGAN: Generative Adversarial Network Combining Swin Transformer and CNN for Remote Sensing Image Super-Resolution.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15: 5662–5673. https://doi.org/10.1109/JSTARS.2022.3190322
Web of Science ®Google Scholar
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention is all you Need.” Advances in Neural Information Processing Systems 30: 5998–6008.
Google Scholar
Wang, Z., A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing 13 (4): 600–612. https://doi.org/10.1109/TIP.2003.819861
PubMed Web of Science ®Google Scholar
Wang, J., W. Li, M. Zhang, and J. Chanussot. 2023b. “Large Kernel Sparse ConvNet Weighted by Multi-Frequency Attention for Remote Sensing Scene Understanding.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–12.
Web of Science ®Google Scholar
Wang, J., W. Li, M. Zhang, R. Tao, and J. Chanussot. 2023a. “Remote Sensing Scene Classification Via Multi-Stage Self-Guided Separation Network.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–12.
Google Scholar
Wang, W., L. Liu, T. Zhang, J. Shen, J. Wang, and J. Li. 2022. “Hyper-ES2T: Efficient Spatial–Spectral Transformer for the Classification of Hyperspectral Remote Sensing Images.” International Journal of Applied Earth Observation and Geoinformation 113: 103005. https://doi.org/10.1016/j.jag.2022.103005
Web of Science ®Google Scholar
Weber, C., R. Aguejdad, X. Briottet, J. Avala, S. Fabre, J. Demuynck, … N. Chehata. 2018. “Hyperspectral Imagery for Environmental Urban Planning.” In IGARSS 2018 IEEE International Geoscience and Remote Sensing Symposium, 1628–1631. Valencia, Spain.
Google Scholar
Wen, Chen, and Wang Yuan-fei. 2011. “A Study on A New Method of Multi-Spatial-Resolution Remote Sensing Image Fusion Based on GA-BP.” Remote Sensing Technology and Application 22 (4): 555–559.
Google Scholar
Xinlei, W., and L. Naifeng. 2016. “Super-resolution of Remote Sensing Images via Sparse Structural Manifold Embedding.” Neurocomputing 173: 1402–1411. https://doi.org/10.1016/j.neucom.2015.09.012
Web of Science ®Google Scholar
Yokoya, N., and A. Iwasaki. 2016. Airborne Hyperspectral Data Over Chikusei. Space Appl. Lab., Univ. Tokyo, Japan, Tech. Rep. SAL-2016-05-27, 5.
Google Scholar
Zhang, T., Y. Du, and F. Lu. 2017. “Super-resolution Reconstruction of Remote Sensing Images Using Multiple-Point Statistics and Isometric Mapping.” Remote Sensing 9 (7): 724. https://doi.org/10.3390/rs9070724
Web of Science ®Google Scholar
Zhang, S., G. Fu, H. Wang, and Y. Zhao. 2021. “Degradation Learning for Unsupervised Hyperspectral Image Super-Resolution Based on Generative Adversarial Network.” Signal, Image and Video Processing 15 (8): 1695–1703. https://doi.org/10.1007/s11760-021-01902-9
Web of Science ®Google Scholar
Zhang, M., W. Li, X. Zhao, H. Liu, R. Tao, and Q. Du. 2023. “Morphological Transformation and Spatial-Logical Aggregation for Tree Species Classification Using Hyperspectral Imagery.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–12.
Web of Science ®Google Scholar
Zhang, X., Y. Sun, K. Shang, L. Zhang, and S. Wang. 2016. “Crop Classification Based on Feature Band set Construction and Object-Oriented Approach Using Hyperspectral Images.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (9): 4117–4128. https://doi.org/10.1109/JSTARS.2016.2577339
Web of Science ®Google Scholar
Zhang, K., B. Wang, W. Zuo, H. Zhang, and L. Zhang. 2015. “Joint Learning of Multiple Regressors for Single Image Super-Resolution.” IEEE Signal Processing Letters 23 (1): 102–106. https://doi.org/10.1109/LSP.2015.2504121
Web of Science ®Google Scholar
Zhu, Q., W. Deng, Z. Zheng, Y. Zhong, Q. Guan, W. Lin, L. Zhang, and D. Li. 2021. “A Spectral-Spatial-Dependent Global Learning Framework for Insufficient and Imbalanced Hyperspectral Image Classification.” IEEE Transactions on Cybernetics 52 (11): 11709–11723. https://doi.org/10.1109/TCYB.2021.3070577.
Web of Science ®Google Scholar
Zhu, J., J. Li, M. Zhu, L. Qian, M. Zhang, and G. Zhou. 2019. “Modeling Graph Structure in Transformer for Better AMR-To-Text Generation.” arXiv preprint arXiv:1909.00136.
Google Scholar
Zhu, Q., L. Wang, J. Chen, W. Zeng, Y. Zhong, Q. Guan, and Z. Yang. 2021. “S 3 trm: Spectral-Spatial Unmixing of Hyperspectral Imagery Based on Sparse Topic Relaxation-Clustering Model.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–13.
Web of Science ®Google Scholar

A spectral and spatial transformer for hyperspectral remote sensing image super-resolution

ABSTRACT

1. Introduction

2. Related work

2.1. SR of HSIs based on traditional methods

2.2. SR of HSIs based on deep learning methods

2.3. SR of HSIs with integrating spectral and spatial features

3. Methodology

3.1. Network architecture

3.1.1. Linear embedding

3.1.2. Residual group

3.1.3. Merging layer

3.1.4. Upsampling layer

3.1.5. Global skip connection

3.1.6. Inverse embedding layer

3.1.7. Loss function

3.2. Integrated transformer block

3.2.1. Spectral transformer block

3.2.2. Spatial transformer block

3.2.3. 3D transformer block

4. Experiments

4.1. Dataset and preprocessing

4.2. Parameter and evaluation indicators

Table 1. The hyperparameter configuration of the SST model.

4.3. Ablation analysis

Table 2. Experimental results from the different network structures on the Houston dataset (scale factor 2). The bolded items represent the best experimental results.

4.4. Comparative experiments

Table 3. Quantitative comparison on the Pavia Centre dataset. The bolded items represent the best experimental results; the optimal standard deviation of experimental results and the SST algorithm's standard deviation are in parentheses (the same below).

Table 4. Quantitative comparison on the Houston dataset. The bolded items represent the best experimental results.

Table 5. Quantitative comparison on the Chikusei dataset. The bolded items represent the best experimental results.

4.5. SR of the multispectral images in real scenarios

Table 6. The results from the wetland landscape classification in Ruoergai National Park using SR images.

4.6. Computational cost

Table 7. Comparison of the computational cost and super-resolution effect on the Chikusei dataset (scale factor 4).

5 Discussion

5.1. Cases using SSTM and 3DTM

5.2. Application potential of SST

5.3. Inspiration for future research

5.4. Limitations and directions for improvements

6 Conclusions

Disclosure statement

Data availability statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date