1,432
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Spatial-spectral hierarchical vision permutator for hyperspectral image classification

, , ORCID Icon &
Article: 2153747 | Received 30 May 2022, Accepted 28 Nov 2022, Published online: 08 Dec 2022

ABSTRACT

In recent years, the convolutional neural network (CNN) has been widely applied to hyperspectral image classification because of its powerful feature capture ability. Nevertheless, the performance of most convolutional operations is limited by the fixed shape and size of the convolutional kernel, which causes CNN cannot fully extract global features. To address this issue, this article proposes a novel classification architecture named spatial-spectral hierarchical Vision Permutator (S2HViP). It contains a spectral module and a spatial module. In the spectral module, we divide the data into groups along the spectral dimension and treat each pixel within the group as a spectral token. Spectral long-range dependencies are obtained by fusing intra- and inter-group spectral correlations captured by multi-layer perceptrons (MLPs). In the spatial module, we first model spatial information via morphological methods and divide the resulting spatial feature maps into spatial tokens of uniform size. Then, the global spatial information is extracted through MLPs. Finally, the extracted spectral and spatial features are combined for classification. Particularly, the proposed MLP structure is an improved Vision Permutator, which presents a hierarchical fusion strategy aiming at generating discriminative features. Experimental results show that S2HViP can provide competitive performance compared to several state-of-the-art methods.

Introduction

Hyperspectral image (HSI) contains hundreds of narrow and contiguous spectral bands, with wavelengths spanning the visible to the infrared spectrum (Chang, Citation2007). Because of its rich spectral and spatial information, HSI has been widely used in various fields, such as target detection, precision agriculture, mineral exploration, and image fusion (Carrino et al., Citation2018; Imani & Ghassemian, Citation2020; Shen et al., Citation2022; Wei et al., Citation2019). HSI classification is one of the most vibrant tasks in remote sensing applications and has undoubtedly received a lot of attention, which aims at assigning a spectrum to one certain class (Bioucas-Dias et al., Citation2013; Camps-Valls et al., Citation2013; Liu et al., Citation2020). In the early stages of the study, most methods used exclusively spectral characteristics of a pixel to complete the classification task. Some of these methods focused on feature extraction or dimensionality reduction, such as principal component analysis (PCA; Prasad & Bruce, Citation2008), independent component analysis (Tu, Citation2000), and linear discriminant analysis (Bandos et al., Citation2009). Some other pixel-wise approaches were designed as classifiers, e.g. support vector machines (SVM; Melgani & Bruzzone, Citation2004), multinomial logistic regression (Pal, Citation2012), and random forest (Ham et al., Citation2005).

The problem of purely using spectral information is not only reflected in the lack of spatial information, but also that spectral data is affected by incident illumination, atmospheric effects and instrument noise. These factors greatly limit the accuracy of pixel-level classification methods (Li, Song et al., Citation2019). To compensate for the inadequacy of single spectral information, some researchers proposed new methods to obtain spatial features for information supplementation (Benediktsson et al., Citation2005; Camps-Valls et al., Citation2006; Fauvel et al., Citation2013; C. Zhao et al., Citation2017). The results demonstrated the validity of spatial-spectral classification approaches. In addition, some approaches attempted to explore effective feature extraction algorithms for better revealing the structural information of HSI, which can help with subsequent classification tasks. Manifold learning is one of the widely used techniques which can obtain the nonlinear characteristics of data, by combining it with other approaches to improve performance, such as sparse representation (Y. Y. Tang et al., Citation2014), hypergraph framework (Duan et al., Citation2021), and geodesic distance (Duan et al., Citation2022).

In the past few years, with the continuous improvement of computer performance, deep learning has rapidly become a hotspot of research (L. Zhang et al., Citation2016). The learning process of deep learning methods is fully automated through data-driven technology (Liu, Wu et al. Citation2022). Due to its outstanding performance, deep learning has been used in object detection, action recognition, and natural language processing (Minar & Naher, Citation2018). Furthermore, deep learning has also been adopted by many researchers to implement HSI classification tasks. Compared with traditional approaches, deep learning can learn complex and abstract information from data by simulating the recognition process of the brain. Among different deep network models, convolutional neural networks (CNNs) are widely used in HSI classification research(Makantasis et al., Citation2015, July; Yang et al., Citation2018). CNNs can capture contextual spatial information in an end-to-end and hierarchical manner. Additionally, CNNs adopt the weight-share mechanism, which greatly reduces the network parameters. Although the role of CNNs in HSI classification is undeniable, it is not good at capturing long-range dependencies of the data.

The advent of the visual Transformer (Dosovitskiy et al., Citation2020) has given researchers new inspiration to reconsider the image classification process in terms of sequential data to obtain long-range dependencies. The Transformer was first developed by Vaswani et al. (Citation2017) and applied in natural language processing. Numerous experiments have verified the powerful image processing capability of Transformers in the computer vision domain (Carion et al., Citation2020; Z. Liu et al., Citation2021; Touvron et al., Citation2021). In contrast to the Transformer, Tolstikhin et al. (Citation2021) discarded the self-attention operation and proposed an MLP-Mixer (a.k.a., Mixer) architecture based only on multilayer perceptrons (MLPs). It obtains the global perceptual field by matrix transposition and token-mixing projection. Subsequently, Hou et al. (Citation2022) proposed a new MLP-based method named Vision Permutator (ViP), which further splits the information encoding of spatial dimensions to produce more precise positional information. Therefore, ViP has three branches for encoding width, height and channel dimensions, respectively, and it directly sums the features of multiple branches at the end. However, for HSI, the channel information also contains rich features, and aggregating the three branches in one step may result in insufficient fusion of spatial and spectral information.

To address the problems mentioned above, a novel spatial-spectral hierarchical Vision Permutator (S2HViP) network is proposed in this paper for HSI classification. First, in response to the characteristic that HSI feature maps are usually represented in the format of 3-D cubes, we design a hierarchical Vision Permutator (HViP). By utilizing a hierarchical fusion strategy, HViP can not only extract long-range dependencies in parallel from each of the three dimensions of HSI, but also improve the fusion results between dimensions. Then, we design a spectral feature extraction module and a spatial feature extraction module to capture different information. In the spectral module, we group the raw data along the spectral dimension, and the pixels in each group are different spectral tokens. The spectral features are implemented by computing intra- and inter-group spectral correlations. In the spatial module, morphological methods are introduced to facilitate the subsequent network to model deep spatial features. Finally, the features extracted from the different modules are fused to further improve the representation capacity of deep features.

The key contributions of this article can be summarized as follows:

  1. We propose a novel S2HViP network for HSI classification. Compared with the mainstream model based on CNNs, S2HViP is good at capturing the global characteristics of data. We design two different modules to extract spectral and spatial long-range dependencies, and these two kinds of global information are aggregated to obtain robust spatial-spectral features for classification.

  2. We design an improved ViP for HSI, named HViP. Specifically, it can capture features in the width, height and channel dimensions of HSI, respectively. Based on this, HViP effectively aggregates features in different dimensions through a hierarchical fusion strategy to obtain discrimination features.

Related works

In recent years, HSI classification research has continued to advance. During this period, the researchers have tried a variety of methods to improve classification performance. We divide those methods into traditional approaches, CNN-based approaches, Transformer-based approaches, and MLP-based approaches. In this section, we will review the development of classification methods.

Traditional approaches

In the early time, the spectral characteristics of HSI were the focus of most methods. However, high intra-class spectral variability and low inter-class spectral variability both bring challenges for pixel-wise approaches (Ghamisi et al., Citation2017; L. He et al., Citation2018). Considering that the 3-D structure presented by HSI contains a wealth of spectral and spatial features, the introduction of spatial information can effectively improve classification performance (Ghamisi et al., Citation2018). Morphological profiles (MPs) are generated by morphological transformation approaches, which are used to model spatial information (Pesaresi & Benediktsson, Citation2001). Based on this, the extended MP is further proposed to finish the HSI classification task (Benediktsson et al., Citation2005). Random fields and probabilistic graphical models can incorporate spatial features into the classification stage. Markov random field is one of the classic methods used for HSI classification (J. Li et al., Citation2012), it can be utilized in combination with other technologies, such as SVM (Ghamisi et al., Citation2014), multinomial logistic regression (Khodadadzadeh et al., Citation2014), ensemble classifiers (Xia et al., Citation2015) and active learning (S. Sun et al., Citation2015). Besides, conditional random field (J. Zhao et al., Citation2018), superpixel segmentation algorithm (Tu et al., Citation2020), collaborative representation (Liu et al., Citation2016) and subspace projection (Wang et al., Citation2016) are also effective ways to utilize spatial features.

CNN-based approaches

Benefiting from the advancement in parallel computing, deep learning has become the popular method to handle data. Compared with traditional approaches, deep learning-based classifiers use multiple nonlinear networks to hierarchically construct high-level features in an automated manner. Numerous experiments demonstrate its effectiveness in the field of HSI classification (Audebert et al., Citation2019; Zhu et al., Citation2017). Typical methods include stacked autoencoder (Chen et al., Citation2014), deep belief network (T. Li et al., Citation2014), recurrent neural network (Mou et al., Citation2017), and CNN (Chen et al., Citation2016). However, most of the methods in the above models utilize vector inputs except for the CNN, which causes the spatial contextual relationships between pixels to be ignored. The excellent ability of CNN to extract spatial information makes it popular in the HSI field (Guo et al., Citation2020; X. Li et al., Citation2019). Paoletti et al. (Citation2019) presented a deep residual network by using pyramidal bottleneck residual blocks. Song et al. (Citation2018) introduced a deep feature fusion network (DFFN) to fuse the outputs of different hierarchical layers to obtain more robust features. In addition to the 2D-CNN-based approaches described above, using a 3-D kernel to extract spectral space features is also a natural solution for HSI classification (Hamida et al., Citation2018; Zhong et al., Citation2018). Roy et al. (Citation2020) combined the advantages of 2D-CNN and 3D-CNN to construct a hybrid spectral CNN (HybridSN). The proposed attention mechanism has given a further boost to network performance. Ma et al. (Citation2019) utilized two kinds of branches to form a double-branch multi-attention mechanism network (DBMA) to capture spectral and spatial features, different types of attention mechanisms are used in different branches. Roy et al. (Citation2021) designed a ResNet with spectral attention named attention-based adaptive spectral-spatial kernel ResNet (A2S2K-ResNet). Qing and Liu (Citation2021) presented a multi-scale residual network model with an attention mechanism and extracted spatial-spectral features from different scales. For extracting global spatial information, Zheng et al. (Citation2020) proposed a novel patch-free global learning architecture, which uses the entire HSI as input to a fully convolutional network. Wu et al. (Citation2021) adopted a similar learning architecture and introduced a composite kernel learning network, which extracts spatial-spectral generalized kernel features for accurate classification. Subsequently, a dual-channel convolution network was developed by Yu et al. (Citation2022), it aims to make full use of the global and multi-scale information of HSI.

Transformer-based approaches

ViT (Dosovitskiy et al., Citation2020) is the first work to construct a vision backbone based exclusively on the Transformer. It demonstrates excellent global feature extraction capabilities. Due to its powerful performance, Transformer is not unexpectedly migrated to HSI classification tasks. J. He et al. (Citation2020) proved that the language model can be applied in HSI classification and proposed HSI-BERT, which is a model consisting of numerous attention layers. Inspired by the HSI-BERT, ZZhong et al. (Citation2022) designed a spectral-spatial Transformer network (SSTN) that replaced convolution operations with attention modules. SSTN was built from several spatial attention modules and spectral correlation modules. The spatial attention module is responsible for capturing the interactions between pixels at all locations, while the spectral correlation module is concerned with the correlation between a compact set of spectral vectors to all locations. Hong et al. (Citation2022) applied the densely sampled method to group the spectral dimensions, and then, a cross-layer Transformer encoder module was employed to learn advanced features from group-wise adjacent bands. Liu, Yu et al. (Citation2022) successfully replaced the traditional convolutional layer with the Transformer, designing a spectral-spatial HSI classification model named DSS-TRM. L. Sun et al. (Citation2022) developed a spectral-spatial feature tokenization Transformer method, which organically combines CNN and Transformer to capture spectral-spatial information and high-level semantic information.

MLP-based approaches

Recently, Tolstikhin et al. (Citation2021) demonstrated that the self-attention layer of Transformers is not necessary and presented a concise alternative model, MLP-Mixer. It is made up of two components: the token-mixing MLP and the channel-mixing MLP. The token-mixing MLP projects feature maps along the spatial dimension to obtain spatial features between different locations. The channel-mixing MLP acts independently on each channel of the feature map to capture the communication between different channels. Inspired by MLP-Mixer, Hou et al. (Citation2022) proposed an effective MLP architecture, Vision Permutator (ViP). The main difference from the MLP-Mixer is that ViP splits spatial feature representation into height-coding and width-coding features respectively, and performs linear projections independently. It allows ViP to capture long-range dependencies along one spatial direction while retaining location information along the other direction. The MLP-based method has also received attention in the hyperspectral field. He and Chen (Citation2021) proposed a pure MLP architecture for the HSI classification task, which demonstrates that MLP networks can provide promising classification performance. Meng et al. (Citation2021) used Mixer as the backbone of the network to extract spatial and spectral information alternately by matrix transposition and MLPs, which allows for interaction between different information. A model based on MLP network and residual learning was presented by X. J. Tang et al. (Citation2022), MLP effectively removes translation invariance and local connectivity. Gong et al. (Citation2022) proposed an MLP architecture, which applies a ladder-like connected structure to obtain contextual interaction for HSI classification tasks. Moreover, Lin et al. (Citation2022) proposed a multi-scale U-shape MLP, which consists of the designed multi-scale channel block and the U-shape MLP.

Proposed framework

In this section, we will introduce the overall structure of S2HViP in detail. Our method contains two modules, the spatial module and the spectral module. shows the framework diagram of our method.

Figure 1. The overall architecture of the proposed S2HViP.

Figure 1. The overall architecture of the proposed S2HViP.

Spectral module

An initial HSI data IRH×W×B is given, where H, W, and B represent the height, the width, and the number of spectral bands, respectively. We divide I into non-overlapping M groups along the spectral dimension by 3D-CNN to get grouped data XgroupRH×W×M, where M is the number of groups, and consider each pixel in the groups as a spectral token. If B is not divisible by M, we use the first few bands of I to pad. Then, Xgroup is projected to spectral tokens data Xspe RH×W×C by a linear embedding layer, where C is the hidden layer dimension number. The above process is spectral patch embedding, details of the implementation can be found in . We enter Xspe  into the subsequent MLP-based layers. The flowchart of the hierarchical Permutator (HP) block is shown in , the biggest difference between our method and Transformer is that we abandon the self-attention operation. A basic HP block consists of LayerNorms, skip connections, hierarchical Permute-MLP (a.k.a., Permute-MLP) and Channel-MLP. We choose the Gaussian error linear unit (GELU; Hendrycks & Gimpel, Citation2016) as the activation function. The data handling process of HP can be expressed as:

(1) Yspe=PermuteMLPLNXspe+Xspe(1)
(2) Zspe=ChannelMLPLNYspe+Yspe(2)

Figure 2. Flowchart of hierarchical Permutator.

Figure 2. Flowchart of hierarchical Permutator.

Table 1. Configuration of the spectral module for the Indian Pines dataset.

where LN means LayerNorm, Yspe and Zspe represent the spectral features extracted at different stages respectively. The Zspe serves as the input to the subsequent HP blocks.

The visual illustration of the hierarchical Permute-MLP is shown in , which has three branches. Linear projections are used to model the input 3-D features separately along their respective dimensions. In the spectral module, we indicate height features, width features and channel features of the extracted feature maps by XHspe, XWspe, and XCspe respectively. The height and width dimensions represent the spectral correlation within the same group, and the channel dimension stands for the spectral correlation between groups. Specifically, we illustrate the information extraction process using the example of height. We conduct a height-channel permutation operation on Xspe in the beginning. First, the channel dimension is divided into T parts to yield XHispeRH×W×N,i1,,T, satisfying C=N×T, N is the number of spectra in each part, and it is numerically equal to H. Next, the first and third dimensions of XHispe are permutated to get XHispeRN×W×H. All XHi are spliced together along the third dimension in the next step to output XHispeRN×W×C. Responsible for encoding XHspe is a fully connected (FC) layer with weight WHspeRC×C. After feature extraction, the original input dimension can be obtained only by performing the inverse process of dimension operation described earlier to yield XHspe. In the second branch, we perform a similar operation as above to achieve XWspe. The feature encoding of the third branch via a FC layer with weights WCspeRC×C to yield XCspe. So far, three types of features XHspe,XWspe,XCspe have been produced by three branches. We use a hierarchical fusion approach to get more recognisable features. The intra-group spectral information is first fused to refine the correlated features between different pixels in the same group. The fused features are then combined with inter-group correlations to obtain spectral long-range dependencies, and inter-group correlations are obtained by mining the spectral features of different groups. To distinguish the importance of different branches, we adopt the split-attention proposed by H. Zhang et al. (Citation2020) to assign weights in the fusion process. The results of the recalibrated fusion of different branches are added to the next FC layer. The fusion operation for correlations of the spectrum can be calculated as follows:

(3) X^spe=FC(FC(XHspe+XWspe)+XCspe)(3)

Figure 3. Basic structure of the hierarchical Permute-MLP layers.

Figure 3. Basic structure of the hierarchical Permute-MLP layers.

where FC stands for an FC layer. X^spe means the output spectral features. After several HP blocks, the spectral information will be sent to the global average pooling layer to obtain the spatial vector for fusion and data classification.

Spatial module

In the spatial module, the extended multi-attribute profile (EMAP) (Dalla Mura, Atli Benediktsson et al., Citation2010) is utilized to extract preliminary spatial features. Specifically, attribute profiles (APs) (Dalla Mura, Benediktsson et al., Citation2010) are a multi-level decomposition of the input image based on attribute filters. In contrast to MPs, APs can process images according to a variety of flexibly defined attributes. An AP is obtained by:

(4) APf=ϕnf,ϕn1f,,ϕ1f,f,\break     γ1f,,γn1f,γnf(4)

where f is the input image, ϕjf and γjf represent opening and closing operators respectively. Opening and closing are a pair of opposite morphological operators that process images through sliding windows called the structuring element (SE). The size of SE will affect the degree of image processing. Then, we calculate the AP for each principal component (PC) after dimension reduction to get extended APs (EAPs), which can model the needed spatial features.

(5) EAP=APPC1,APPC2,,APPCm(5)

where m is the number of spectra after PCA. An EMAP contains different EAPs with the aim of extracting different information from the scene, which can be expressed as:

(6) EMAP=EAPa1,EAPa2,,EAPak(6)

where a1,a2,,ak denotes the k different attributes.

Mathematically, the input to the spatial module is also IRH×W×B, PCA is first used to process the data to obtain the primary spatial information of the m PCs. After procedures of the EMAP, we have preliminary spatial features IemapRH×W×G, G is the number of the spectral channels. The next step is to divide the obtained features into several tokens evenly and convert the channel dimension number to the hidden layer dimension number. In this module, we choose 2D-CNN to accomplish the spatial token embedding operation. The convolutional layer is composed of C 2-D kernels, The spatial size and stride of each 2-D kernel are both p. Then, C-dim spatial tokens data XspaRh×w×C are generated, where h and w enumerate the height and the width of tokens, satisfying h=H/p and w=W/p. The implementation details of the embedding operation are available in . These spatial tokens will be sent into a series of HP blocks to extract deeper robust spatial information. Instead of flattening the spatial dimensions, we input the 3-D data Xspa directly and encode the width and height of the tokens separately. This has the advantage of enabling HViP to capture long-range dependencies in one spatial dimension while preserving position information in the other dimension. In the layer of Permute-MLP, we first perform a weighted fusion of the location information of the height XHspa and width XWspa branches to produce spatial feature representations. Based on the strategy of hierarchical fusion, after fusing the features of XHspa and XWspa, the information from the channel branches XCspa is utilized as a complement to further combine to generate the global spatial features. The corresponding calculation process is given as:

(7) X^spa=FC(FC(XHspa+XWspa)+XCspa)(7)

Table 2. Configuration of the spatial module for the Indian Pines dataset.

where X^spa means the output spatial features. The deep spatial long-range dependencies are obtained after several HP blocks. At the end of the spatial module, the final spatial vector is got by passing through a global average pooling layer, which is also used for fusion and data classification.

Weight fusion

Considering the extracted spectral and spatial features are generated from separate modules, we utilize the weighted fusion approach to obtain spectral-spatial joint features. Specifically, different weighted scores are assigned to spectral features Fspe and spatial features Fspa, fusing these two features by summing. The fusion progress can be formulated as:

(8) F=λ×Fspe+1λ×Fspa(8)

where F denotes the fusion features, λ stands the weighting parameter in the range of 0,1. We send F into a FC layer to get the final feature vector. Note that the length of the vector is the same as the number of classes, and the vector is treated as a class-specific response.

Experiments

Hyperspectral datasets

In this paper, the effectiveness of the proposed method was validated on three benchmark hyperspectral datasets. The detailed data description is presented below.

  1. The Indian Pines (IP) was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in Northwestern Indiana, USA. It contains 145 × 145 pixels with a spatial resolution of 20 m/pixel (mpp) and 224 spectral bands in the wavelength range from 400 to 2500 nm. We utilize 200 bands after removing four bands containing zero values and 20 noisy bands. The ground truth includes 16 land-cover classes and 10,249 labelled pixels. shows the false-colour composite image and corresponding ground reference of IP. outlines the number of labelled samples and classes of IP.

    Figure 4. False-colour composite image and ground-truth map of Indian Pines.

    Figure 4. False-colour composite image and ground-truth map of Indian Pines.

    Table 3. Land cover classes and the numbers of samples in the Indian Pines dataset.

  • (2) The Salinas Valley (SV) was recorded by the AVIRIS sensor over Salinas Valley, California, USA. The available dataset is composed of 204 bands with 512 × 217 pixels with a spatial resolution of 3.7 m after the low signal-to-noise ratio (SNR) bands were removed. The ground truth includes 16 land-cover classes and 54,129 labelled pixels. shows the false-colour composite image and corresponding ground reference of SV and the details of labelled samples and classes are presented in .

    Figure 5. False-colour composite image and ground-truth map of Salinas Valley.

    Figure 5. False-colour composite image and ground-truth map of Salinas Valley.

    Table 4. Land cover classes and the numbers of samples in the Salinas Valley dataset.

  • (3) The University of Pavia (UP) was acquired by the ROSIS sensor during a flight campaign over Pavia, Northern Italy. It consists of 610 × 340 pixels with a spatial resolution of 1.3 m and has 115 spectral channels in the wavelength range from 0.43 to 0.86 µm. We utilize 103 bands after removing 12 noisy bands. The ground truth includes 9 land-cover classes and 42,776 labelled pixels. shows the false-colour composite image and corresponding ground reference of UP. The detailed numbers of labelled samples and classes of UP are listed in .

    Figure 6. False-colour composite image and ground-truth map of University of Pavia.

    Figure 6. False-colour composite image and ground-truth map of University of Pavia.

    Table 5. Land cover classes and the numbers of samples in the University of Pavia dataset.

Experimental configuration

To verify the effectiveness of the proposed method, we compare it with several recently proposed HSI classification methods, including DFFN (song et al., Citation2018), SSRN (Zhong et al., Citation2018), DBMA (Ma et al., Citation2019), pResNet (Paoletti et al., Citation2019), HybridSN (Roy et al., Citation2020), A2S2K-ResNet (Roy et al., Citation2021), FreeNet (Zheng et al., Citation2020), DPSCN (Dang et al., Citation2021), SSTN (ZZhong et al., Citation2022), and Mixer (Tolstikhin et al., Citation2021). For the fairness of the experiment, we adopt unified measurements to illustrate the effectiveness of our framework. These measurements are overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa).

Details of experimental configurations are presented as follows. For the IP dataset, 1% of labelled samples are randomly chosen from each class to form the training set, and the rest are used as the testing set. For UP and SV datasets, 0.5% and 99.5% of labelled samples are selected for the training set and the testing set. The stochastic gradient descent (SGD) is utilized to train parameters in the whole architecture. The batch size is 50 and the learning rate is 0.001. We set the size of the input patch to 9. And the network runs 200 epochs at a time to get the final result. In the spatial module, two kinds of APs are selected to calculate EMAPs: the moment of inertia and area. For the moment of inertia, all three datasets are computed with the same values {0.2, 0.3, 0.4, 0.5}. The difference is the attribute of area, which has a total of 14 values, different datasets have different values. For the IP dataset, the initial value of the area value is set to 100 with a step size of 400. For the SV dataset, the values of the area are from the range of 270 to 7300 in the step of 540. For the UP dataset, the area values ranged from 770 to 10,769 and the step size is being 769.

Results analysis

For the aforementioned classification approaches, show the OAs, AAs, and Kappas on IP, SV, and UP datasets, respectively. As can be seen, the proposed method performs better than other methods in three datasets. In the following content, we will further analyse the experimental results of three datasets.

Table 6. Classification results for the Indian Pines dataset using 1% training samples.

Table 7. Classification results for the Salinas Valley dataset using 0.5% training samples.

Table 8. Classification results for the University of Pavia dataset using 0.5% training samples.

As shown in , improvements in terms of OA compared to DFFN, SSRN, DBMA, pResNet, HybridSN, FreeNet, A2S2K-ResNet, DPSCN, SSTN, Mixer are 21.19%, 8.43%, 8.31%, 22.91%, 20.55%, 9.64%, 8.31%, 4.84%, 4.18%, and 5.69% respectively. In the case of the IP dataset, the diversity of land cover and the imbalance between sample sizes pose a challenge to the classification of this dataset, especially the quantitative difference between categories 7 and 9 and other categories. Even so, compared with the previous approaches, the proposed S2HViP obtains the best AA and achieves satisfactory accuracy in the class with a small number of samples. In comparison with the second-best method DPSCN, the improved AA value is 4.69%. This is probably due to the great power of capturing long-range spatial interactions of our method, resulting in a lower error score rate when distinguishing between different land-cover classes. The classification map of the IP dataset is displayed in . It is noteworthy that our delineated boundaries are sharper and straighter in the boundary regions of different feature classes, which also reflects the fact that S2HViP can better extract global distribution features. This helps to generate a more reasonable regional division.

Figure 7. Classification maps resulting from different methods for Indian Pines dataset. (a) Ground truth. (b) DFFN (OA:65.39%). (c) SSRN (OA:78.15%). (d) DBMA (OA:78.27%). (e) pResNet (OA:63.67%). (f) HybridSN (OA:66.03%). (g) FreeNet (OA:76.94%). (h) A2S2K-ResNet (OA:78.27%). (i) DPSCN (81.74%). (j) SSTN (OA:82.40%). (k) Mixer (OA:80.89%). (l) Proposed (OA:86.58%).

Figure 7. Classification maps resulting from different methods for Indian Pines dataset. (a) Ground truth. (b) DFFN (OA:65.39%). (c) SSRN (OA:78.15%). (d) DBMA (OA:78.27%). (e) pResNet (OA:63.67%). (f) HybridSN (OA:66.03%). (g) FreeNet (OA:76.94%). (h) A2S2K-ResNet (OA:78.27%). (i) DPSCN (81.74%). (j) SSTN (OA:82.40%). (k) Mixer (OA:80.89%). (l) Proposed (OA:86.58%).

shows the classification results of the SV dataset. The characteristic of this dataset is that similar objects are clustered in a similar area. Therefore, the SV dataset has very strong spatial similarity and rich texture features. Since FreeNet uses the whole HSI as input, the spatial features are well preserved. Therefore, FreeNet gets good results of 97.98% OA. However, the proposed S2HViP obtains a better OA of 98.85%. This can be attributed to the long-distance dependencies extracted by S2HViP, which may be important for classification tasks in large-scale regions. Compared with the MLP-based method Mixer, our approach shows 0.66%, 1.04%, and 0.98% improvements in terms of OA, AA, and Kappa. One reason is that S2HViP further splits the spatial information into width and height dimensions to encode the information separately, thus obtaining finer location information. Another reason is the distribution fusion strategy facilitates better interaction of the information in the three dimensions of the feature map. The classification map of the dataset is shown in . The map still demonstrates strong classification capabilities of S2HViP. Our method has fewer misclassification points in comparison with other methods, which means our classification graph is closer to the truth of the SV dataset.

Figure 8. Classification maps resulting from different methods for the Salinas Valley dataset. (a) Ground truth. (b) DFFN (OA:90.53%). (c) SSRN (OA:95.31%). (d) DBMA (OA:95.68%). (e) pResNet (OA:89.61%). (f) HybridSN (OA:96.64%). (g) FreeNet (OA:97.98%). (h) A2S2K-ResNet (OA:94.68%). (i) DPSCN (94.92%). (j) SSTN (OA:93.14%). (k) Mixer (OA:97.96%). (l) Proposed (OA:98.85%).

Figure 8. Classification maps resulting from different methods for the Salinas Valley dataset. (a) Ground truth. (b) DFFN (OA:90.53%). (c) SSRN (OA:95.31%). (d) DBMA (OA:95.68%). (e) pResNet (OA:89.61%). (f) HybridSN (OA:96.64%). (g) FreeNet (OA:97.98%). (h) A2S2K-ResNet (OA:94.68%). (i) DPSCN (94.92%). (j) SSTN (OA:93.14%). (k) Mixer (OA:97.96%). (l) Proposed (OA:98.85%).

Focusing on the UP dataset, shows the classification results. As shown in , the proposed approach achieves the best OA, AA, and Kappa values of 98.40%, 96.89%, and 97.87%, respectively, while the SSTN network achieves 96.35%, 94.27%, and 95.13%, respectively. The promotions of the OA, AA and Kappa values are 2.05%, 2.62%, and 2.74%, respectively. Compared to Transformer-based methods, S2HViP abandons the self-attention operation. Nevertheless, the results strongly prove the effectiveness of the MLP-based structure. According to , compared to the classification maps of other methods, our method has fewer pixels of other categories appearing within the same category region. On the one hand, our method can efficiently extract long-range dependencies through global receptive fields, reducing the likelihood that pixels within connected regions are mistaken as other high-similarity objects. On the other hand, the introduction of the morphological approaches can complement local spatial features, which helps the subsequent information extraction process. As a result, S2HViP obtains competitive results in comparison with other methods.

Figure 9. Classification maps resulting from different methods for the University of Pavia dataset. (a) Ground truth. (b) DFFN (OA:88.65%). (c) SSRN (OA:96.11%). (d) DBMA (OA:95.97%). (e) pResNet (OA:86.70%). (f) HybridSN (OA:84.81%). (g) FreeNet (OA:94.27%). (h) A2S2K-ResNet (OA:95.05%). (i) DPSCN (97.19%). (j) SSTN (OA:96.35%). (k) Mixer (OA:97.74%). (l) Proposed (OA:98.40%).

Figure 9. Classification maps resulting from different methods for the University of Pavia dataset. (a) Ground truth. (b) DFFN (OA:88.65%). (c) SSRN (OA:96.11%). (d) DBMA (OA:95.97%). (e) pResNet (OA:86.70%). (f) HybridSN (OA:84.81%). (g) FreeNet (OA:94.27%). (h) A2S2K-ResNet (OA:95.05%). (i) DPSCN (97.19%). (j) SSTN (OA:96.35%). (k) Mixer (OA:97.74%). (l) Proposed (OA:98.40%).

Investigation of different numbers of training samples

To further verify the effectiveness of S2HViP, we investigate the performance of the different approaches with different training sample sizes. For IP and SV datasets, 0.1%, 0.3%, 0.5%, 1% and 3% of samples are randomly selected. And for the UP dataset, we randomly choose 0.3%, 0.5%, 1%, 3% and 5% of samples as training sets. The specific results counted on different datasets are displayed in . Overall, the proposed S2HViP shows better performance. As the sample size increases, the performance of all methods tends to improve in general, and the differences between them become smaller. Nevertheless, when the sample is insufficient, our method still shows better performance than other methods. The results in the figure demonstrate the robustness and efficiency of S2HViP, which could be attributed to its excellent long-range dependencies extraction capability. In the case of limited training samples, our method is still able to capture satisfactory global distribution information, resulting in better classification results.

Figure 10. The comparison results using different training sample ratios on (a) Indian Pines, (b) Salinas Valley, (c) University of Pavia.

Figure 10. The comparison results using different training sample ratios on (a) Indian Pines, (b) Salinas Valley, (c) University of Pavia.

Investigation of different sizes of patch

To explore the influence of different patch sizes on the experimental results, on the basis of 9 × 9, we further conducted experiments on the patch sizes of 3 × 3, 6 × 6 and 12 × 12, and the results are shown in . For IP and UP datasets, OA first increases and then decreases as the patch size increases. In addition, S2HViP makes good use of the spatial aggregation properties of the SV dataset. The results show that the value of OA continues to improve as the patch size increases. This indicates that the SV dataset is more adaptive to large patch sizes. Considering the running time of the model, we finally choose 9 × 9 as the consistent patch size of the experiment.

Table 9. Overall accuracy(OA) on three datasets with different patch sizes.

Ablation study

The proposed S2HViP consists of a spatial module and a spectral module. The purpose of the former is to obtain robust spatial features. The latter is dedicated to extracting spectral information from the rich spectral features. To investigate the respective functions of these two parts, we divide the network into two individual modules for separate experiments and observe the respective classification results. In other words, we use either the spatial module or the spectral module alone as the overall feature extraction framework for the network. For the spatial module, we further explore the significance of the EMAP module. As can be gathered from , the spectral module outperforms the spatial module overall without the initial modelling of the image using EMAP. After modelling the spatial information using EMAP, the resulting spatial features are more easily extracted in subsequent modules. The sliding window mechanism of SE also helps to compensate for the local spatial information of feature maps. These factors lead to a significant improvement in the performance of the spatial module. In addition, the global spatial-spectral information obtained by fusing the different features gives the best results when using dual modules.

Table 10. The ablation analysis of different modules (OA%) for three datasets.

To further validate the effectiveness of the hierarchical fusion strategy, we compare the performance gap between HViP and direct weighted fusion approaches, i.e. ViP. The results are listed in . On three different datasets, the HViP achieves 1.78% (IP), 0.17% (SV), and 0.45% (UP) improvement in terms of OA compared to ViP. The experimental results demonstrate the effectiveness of the hierarchical fusion strategy, which can help the three dimensions of the feature map to fuse better and generate more discriminative global features.

Table 11. The ablation analysis of hierarchical integration strategy (OA%) for three datasets.

Conclusion

In this article, we present a novel MLP-based deep classification method. Considering that HSI is presented as 3-D data, we encode the three dimensions individually and combine them through a hierarchical fusion strategy. The strategy is demonstrated to be effective in improving the ability of feature extraction. The proposed S2HViP contains two feature extraction modules for learning the spectral and spatial information of HSI respectively. In the spectral module, we first capture intra- and inter-group spectral correlations by grouping. Then, we aggregate different correlations to refine long-range spectral dependencies. In the spatial module, we introduce morphological approaches to better model spatial features. Based on this, deep spatial information is further captured through MLPs. Finally, the information interaction between the spectral and spatial domains is achieved by weighted fusion, which effectively enhances the classification results. Experiments on three benchmark HSI datasets demonstrate the satisfactory performance of S2HViP and show the potential of the MLP-based network for HSI classification.

In the future, we will consider the application of more innovative morphological methods to the improve feature extraction capability of the network. Moreover, the proposed method still has room for improvement in terms of time consumption. So we will also explore a lightweight MLP-based model for classification.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available at http://www.ehu.eus/ccwintco/index.php?%20title=Hyperspectral_Remote_Sensing_Scenes.

Additional information

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62071204 and by the Natural Science Foundation of Jiangsu Province under Grant No. BK20201338.

References

  • Audebert, N., Le Saux, B., & Lefèvre, S. (2019). Deep learning for classification of hyperspectral data: A comparative review. IEEE Geoscience and Remote Sensing Magazine, 7(2), 159–16. https://doi.org/10.1109/MGRS.2019.2912563
  • Bandos, T. V., Bruzzone, L., & Camps-Valls, G. (2009). Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3), 862–873. https://doi.org/10.1109/TGRS.2008.2005729
  • Benediktsson, J. A., Palmason, J. A., & Sveinsson, J. R. (2005). Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 480–491. https://doi.org/10.1109/TGRS.2004.842478
  • Bioucas-Dias, J. M., Plaza, A., Camps-Valls, G., Scheunders, P., Nasrabadi, N., & Chanussot, J. (2013). Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and Remote Sensing Magazine, 1(2), 6–36. https://doi.org/10.1109/MGRS.2013.2244672
  • Camps-Valls, G., Gomez-Chova, L., Muñoz-Marí, J., Vila-Francés, J., & Calpe-Maravilla, J. (2006). Composite kernels for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 3(1), 93–97. https://doi.org/10.1109/LGRS.2005.857031
  • Camps-Valls, G., Tuia, D., Bruzzone, L., & Benediktsson, J. A. (2013). Advances in hyperspectral image classification: Earth monitoring with statistical learning methods. IEEE Signal Processing Magazine, 31(1), 45–54. https://doi.org/10.1109/MSP.2013.2279179
  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
  • Carrino, T. A., Crósta, A. P., Toledo, C. L. B., & Silva, A. M. (2018). Hyperspectral remote sensing applied to mineral exploration in southern Peru: A multiple data integration approach in the Chapi Chiara gold prospect. International Journal of Applied Earth Observation and Geoinformation, 64, 287–300. https://doi.org/10.1016/j.jag.2017.05.004
  • Chang, C. I. (Ed.). (2007). Hyperspectral data exploitation: Theory and applications. John Wiley & Sons.
  • Chen, Y., Jiang, H., Li, C., Jia, X., & Ghamisi, P. (2016). Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10), 6232–6251. https://doi.org/10.1109/TGRS.2016.2584107
  • Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2014). Deep learning-based classification of hyperspectral data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6), 2094–2107. https://doi.org/10.1109/JSTARS.2014.2329330
  • Dalla Mura, M., Atli Benediktsson, J., Waske, B., & Bruzzone, L. (2010). Extended profiles with morphological attribute filters for the analysis of hyperspectral data. International Journal of Remote Sensing, 31(22), 5975–5991. https://doi.org/10.1080/01431161.2010.512425
  • Dalla Mura, M., Benediktsson, J. A., Waske, B., & Bruzzone, L. (2010). Morphological attribute profiles for the analysis of very high resolution images. IEEE Transactions on Geoscience and Remote Sensing, 48(10), 3747–3762. https://doi.org/10.1109/TGRS.2010.2048116
  • Dang, L., Pang, P., Zuo, X., Liu, Y., & Lee, J. (2021). A dual-path small convolution network for hyperspectral image classification. Remote Sensing, 13(17), 3411. https://doi.org/10.3390/rs13173411
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • Duan, Y., Huang, H., & Tang, Y. (2021). Local constraint-based sparse manifold hypergraph learning for dimensionality reduction of hyperspectral image. IEEE Transactions on Geoscience and Remote Sensing, 59(1), 613–628. https://doi.org/10.1109/TGRS.2020.2995709
  • Duan, Y., Huang, H., & Wang, T. (2022). Semisupervised feature extraction of hyperspectral image using nonlinear geodesic sparse hypergraphs. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15. https://doi.org/10.1109/TGRS.2021.3110855
  • Fauvel, M., Tarabalka, Y., Benediktsson, J. A., Chanussot, J., & Tilton, J. C. (2013). Advances in spectral-spatial classification of hyperspectral images. Proceedings of the IEEE, 101(3), 652–675.
  • Ghamisi, P., Benediktsson, J. A., & Ulfarsson, M. O. (2014). Spectral–spatial classification of hyperspectral images based on hidden Markov random fields. IEEE Transactions on Geoscience and Remote Sensing, 52(5), 2565–2574. https://doi.org/10.1109/TGRS.2013.2263282
  • Ghamisi, P., Maggiori, E., Li, S., Souza, R., Tarablaka, Y., Moser, G., De Giorgi, A., Fang, L., Chen, Y., Chi, M., Serpico, S. B., & Benediktsson, J. A. (2018). New frontiers in spectral-spatial hyperspectral image classification: The latest advances based on mathematical morphology, Markov random fields, segmentation, sparse representation, and deep learning. IEEE Geoscience and Remote Sensing Magazine, 6(3), 10–43. https://doi.org/10.1109/MGRS.2018.2854840
  • Ghamisi, P., Yokoya, N., Li, J., Liao, W., Liu, S., Plaza, J., Plaza, J., Rasti, B., & Plaza, A. (2017). Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art. IEEE Geoscience and Remote Sensing Magazine, 5(4), 37–78. https://doi.org/10.1109/MGRS.2017.2762087
  • Gong, N., Zhang, C., Zhou, H., Zhang, K., Wu, Z., & Zhang, X. (2022). Classification of hyperspectral images via improved cycle‐mlp. IET Computer Vision.
  • Guo, H., Liu, J., Yang, J., Xiao, Z., & Wu, Z. (2020). Deep collaborative attention network for hyperspectral image classification by combining 2-D CNN and 3-D CNN. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 4789–4802. https://doi.org/10.1109/JSTARS.2020.3016739
  • Ham, J., Chen, Y., Crawford, M. M., & Ghosh, J. (2005). Investigation of the random forest framework for classification of hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 492–501. https://doi.org/10.1109/TGRS.2004.842481
  • Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-D deep learning approach for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(8), 4420–4434. https://doi.org/10.1109/TGRS.2018.2818945
  • He, X., & Chen, Y. (2021). Modifications of the multi-layer perceptron for hyperspectral image classification. Remote Sensing, 13(17), 3547. https://doi.org/10.3390/rs13173547
  • He, L., Li, J., Liu, C., & Li, S. (2018). Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines. IEEE Transactions on Geoscience and Remote Sensing, 56(3), 1579–1597. https://doi.org/10.1109/TGRS.2017.2765364
  • Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
  • He, J., Zhao, L., Yang, H., Zhang, M., & Li, W. (2020). HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Transactions on Geoscience and Remote Sensing, 58(1), 165–178. https://doi.org/10.1109/TGRS.2019.2934760
  • Hong, D., Han, Z., Yao, J., Gao, L., Zhang, B., Plaza, A., & Chanussot, J. (2022). SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15. https://doi.org/10.1109/TGRS.2022.3172371
  • Hou, Q., Jiang, Z., Yuan, L., Cheng, M. M., Yan, S., & Feng, J. (2022). Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Imani, M., & Ghassemian, H. (2020). An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges. Information Fusion, 59, 59–83. https://doi.org/10.1016/j.inffus.2020.01.007
  • Khodadadzadeh, M., Li, J., Plaza, A., Ghassemian, H., Bioucas-Dias, J. M., & Li, X. (2014). Spectral–spatial classification of hyperspectral data using local and global probabilities for mixed pixel characterization. IEEE Transactions on Geoscience and Remote Sensing, 52(10), 6298–6314. https://doi.org/10.1109/TGRS.2013.2296031
  • Li, J., Bioucas-Dias, J. M., & Plaza, A. (2012). Spectral–spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields. IEEE Transactions on Geoscience and Remote Sensing, 50(3), 809–823. https://doi.org/10.1109/TGRS.2011.2162649
  • Li, X., Ding, M., & Pižurica, A. (2019). Deep feature fusion via two-stream convolutional neural network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 58(4), 2615–2629. https://doi.org/10.1109/TGRS.2019.2952758
  • Lin, M., Jing, W., Di, D., Chen, G., & Song, H. (2022). Multi-scale U-shape MLP for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. https://doi.org/10.1109/LGRS.2022.3141547
  • Li, S., Song, W., Fang, L., Chen, Y., Ghamisi, P., & Benediktsson, J. A. (2019). Deep learning for hyperspectral image classification: An overview. IEEE Transactions on Geoscience and Remote Sensing, 57(9), 6690–6709. https://doi.org/10.1109/TGRS.2019.2907932
  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
  • Liu, J., Wu, Z., Li, J., Plaza, A., & Yuan, Y. (2016). Probabilistic-kernel collaborative representation for spatial–spectral hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 54(4), 2371–2384. https://doi.org/10.1109/TGRS.2015.2500680
  • Liu, J., Wu, Z., Xiao, L., Sun, J., & Yan, H. (2020). Generalized tensor regression for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 58(2), 1244–1258. https://doi.org/10.1109/TGRS.2019.2944989
  • Liu, J., Wu, Z., Xiao, L., & Wu, X. J. (2022). Model inspired autoencoder for unsupervised hyperspectral image super-resolution. IEEE Transactions on Geoscience and Remote Sensing.
  • Liu, B., Yu, A., Gao, K., Tan, X., Sun, Y., & Yu, X. (2022). DSS-TRM: Deep spatial–spectral transformer for hyperspectral image classification. European Journal of Remote Sensing, 55(1), 103–114. https://doi.org/10.1080/22797254.2021.2023910
  • Li, T., Zhang, J., & Zhang, Y. (2014). Classification of hyperspectral image based on deep belief networks. In 2014 IEEE international conference on image processing (ICIP) (pp. 5132–5136). IEEE.
  • Makantasis, K., Karantzalos, K., Doulamis, A., & Doulamis, N. (2015, July). Deep supervised learning for hyperspectral data classification through convolutional neural networks. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) (pp. 4959–4962). IEEE.
  • Ma, W., Yang, Q., Wu, Y., Zhao, W., & Zhang, X. (2019). Double-branch multi-attention mechanism network for hyperspectral image classification. Remote Sensing, 11(11), 1307. https://doi.org/10.3390/rs11111307
  • Melgani, F., & Bruzzone, L. (2004). Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42(8), 1778–1790. https://doi.org/10.1109/TGRS.2004.831865
  • Meng, Z., Zhao, F., & Liang, M. (2021). SS-MLP: A novel spectral-spatial MLP architecture for hyperspectral image classification. Remote Sensing, 13(20), 4060. https://doi.org/10.3390/rs13204060
  • Minar, M. R., & Naher, J. (2018). Recent advances in deep learning: An overview. arXiv preprint arXiv:1807.08169.
  • Mou, L., Ghamisi, P., & Zhu, X. X. (2017). Deep recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7), 3639–3655. https://doi.org/10.1109/TGRS.2016.2636241
  • Pal, M. (2012). Multinomial logistic regression-based feature selection for hyperspectral data. International Journal of Applied Earth Observation and Geoinformation, 14(1), 214–220. https://doi.org/10.1016/j.jag.2011.09.014
  • Paoletti, M. E., Haut, J. M., Fernandez-Beltran, R., Plaza, J., Plaza, A. J., & Pla, F. (2019). Deep pyramidal residual networks for spectral–spatial hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(2), 740–754. https://doi.org/10.1109/TGRS.2018.2860125
  • Pesaresi, M., & Benediktsson, J. A. (2001). A new approach for the morphological segmentation of high-resolution satellite imagery. IEEE Transactions on Geoscience and Remote Sensing, 39(2), 309–320. https://doi.org/10.1109/36.905239
  • Prasad, S., & Bruce, L. M. (2008). Limitations of principal components analysis for hyperspectral target recognition. IEEE Geoscience and Remote Sensing Letters, 5(4), 625–629. https://doi.org/10.1109/LGRS.2008.2001282
  • Qing, Y., & Liu, W. (2021). Hyperspectral image classification based on multi-scale residual network with attention mechanism. Remote Sensing, 13(3), 335. https://doi.org/10.3390/rs13030335
  • Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2020). HybridSN: Exploring 3D-2D CNN feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 17(2), 277–281. https://doi.org/10.1109/LGRS.2019.2918719
  • Roy, S. K., Manna, S., Song, T., & Bruzzone, L. (2021). Attention-based adaptive spectral–spatial kernel ResNet for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59(9), 7831–7843. https://doi.org/10.1109/TGRS.2020.3043267
  • Shen, D., Liu, J., Wu, Z., Yang, J., & Xiao, L. (2022). ADMM-HFNet: A matrix decomposition-based deep approach for hyperspectral image fusion. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–17. https://doi.org/10.1109/TGRS.2021.3112181
  • Song, W., Li, S., Fang, L., & Lu, T. (2018). Hyperspectral image classification with deep feature fusion network. IEEE Transactions on Geoscience and Remote Sensing, 56(6), 3173–3184. https://doi.org/10.1109/TGRS.2018.2794326
  • Sun, L., Zhao, G., Zheng, Y., & Wu, Z. (2022). Spectral-spatial feature tokenization transformer for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing.
  • Sun, S., Zhong, P., Xiao, H., & Wang, R. (2015). An MRF model-based active learning framework for the spectral-spatial classification of hyperspectral imagery. IEEE Journal of Selected Topics in Signal Processing, 9(6), 1074–1088. https://doi.org/10.1109/JSTSP.2015.2414401
  • Tang, X. J., Liu, X., Yan, P. F., Li, B. X., Qi, H. Y., & Huang, F. (2022). An MLP Network Based on Residual Learning for Rice Hyperspectral Data Classification. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2022.3149185
  • Tang, Y. Y., Yuan, H., & Li, L. (2014). Manifold-based sparse representation for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 52(12), 7606–7618. https://doi.org/10.1109/TGRS.2014.2315209
  • Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., & Dosovitskiy, A. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34. https://proceedings.neurips.cc/paper/2021/file/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf
  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
  • Tu, T. (2000). Unsupervised signature extraction and separation in hyperspectral images: A noise-adjusted fast independent component analysis. Optical Engineering, 39(4), 897–906. https://doi.org/10.1117/1.602461
  • Tu, B., Zhou, C., He, D., Huang, S., & Plaza, A. (2020). Hyperspectral classification with noisy label detection via superpixel-to-pixel weighting distance. IEEE Transactions on Geoscience and Remote Sensing, 58(6), 4116–4131. https://doi.org/10.1109/TGRS.2019.2961141
  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • Wang, Q., Lin, J., & Yuan, Y. (2016). Salient band selection for hyperspectral image classification via manifold ranking. IEEE Transactions on Neural Networks and Learning Systems, 27(6), 1279–1289. https://doi.org/10.1109/TNNLS.2015.2477537
  • Wei, X., Yu, X., Liu, B., & Zhi, L. (2019). Convolutional neural networks and local binary patterns for hyperspectral image classification. European Journal of Remote Sensing, 52(1), 448–462. https://doi.org/10.1080/22797254.2019.1634980
  • Wu, Z., Liu, J., Yang, J., Xiao, Z., & Xiao, L. (2021). Composite kernel learning network for hyperspectral image classification. International Journal of Remote Sensing, 42(16), 6066–6089. https://doi.org/10.1080/01431161.2021.1934599
  • Xia, J., Chanussot, J., Du, P., & He, X. (2015). Spectral–spatial classification for hyperspectral data using rotation forests with local feature extraction and Markov random fields. IEEE Transactions on Geoscience and Remote Sensing, 53(5), 2532–2546. https://doi.org/10.1109/TGRS.2014.2361618
  • Yang, X., Ye, Y., Li, X., Lau, R. Y., Zhang, X., & Huang, X. (2018). Hyperspectral image classification with deep learning models. IEEE Transactions on Geoscience and Remote Sensing, 56(9), 5408–5423. https://doi.org/10.1109/TGRS.2018.2815613
  • Yu, H., Zhang, H., Liu, Y., Zheng, K., Xu, Z., & Xiao, C. (2022). Dual-channel convolution network with image-based global learning framework for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2021.3139358
  • Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., … Smola, A. (2020). Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955.
  • Zhang, L., Zhang, L., & Du, B. (2016). Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2), 22–40. https://doi.org/10.1109/MGRS.2016.2540798
  • Zhao, C., Wan, X., Zhao, G., Cui, B., Liu, W., & Qi, B. (2017). Spectral-spatial classification of hyperspectral imagery based on stacked sparse autoencoder and random forest. European Journal of Remote Sensing, 50(1), 47–63. https://doi.org/10.1080/22797254.2017.1274566
  • Zhao, J., Zhong, Y., Jia, T., Wang, X., Xu, Y., Shu, H., & Zhang, L. (2018). Spectral-spatial classification of hyperspectral imagery with cooperative game. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 31–42. https://doi.org/10.1016/j.isprsjprs.2017.10.006
  • Zheng, Z., Zhong, Y., Ma, A., & Zhang, L. (2020). FPGA: Fast patch-free global learning framework for fully end-to-end hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 58(8), 5612–5626. https://doi.org/10.1109/TGRS.2020.2967821
  • Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2018). Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Transactions on Geoscience and Remote Sensing, 56(2), 847–858. https://doi.org/10.1109/TGRS.2017.2755542
  • Zhong, Z., Li, Y., Ma, L., Li, J., & Zheng, W. S. (2022). Spectral-spatial transformer network for hyperspectral image classification: A factorized architecture search framework. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15. https://doi.org/10.1109/TGRS.2022.3217887
  • Zhu, X. X., Tuia, D., Mou, L., Xia, G. S., Zhang, L., Xu, F., & Fraundorfer, F. (2017). Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine, 5(4), 8–36. https://doi.org/10.1109/MGRS.2017.2762307