Full article: Spectral Spatial Neighborhood Attention Transformer for Hyperspectral Image Classification

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Hyperspectral image (HSI) can provide rich spectral information, which can be helpful for accurate classification in many applications. However, the hyperspectral image classification task has challenges, including limited labeled data, data redundancy, data sparsity, and imbalanced class samples. Over time, various methods have been proposed to solve the above-mentioned problems. To mitigate the issue mentioned above in this paper, we present the neighborhood attention transformer with a channel-wise shift technique for hyperspectral image classification. The neighborhood attention transformer leverages the power of the attention mechanism to capture the spatial relationship between neighboring pixels and extract discriminative features. The channel-wise shift techniques empower the model to adjust the spectral characteristics of each channel adaptively, enhancing its ability to handle the spectral variations present in hyperspectral data. To validate the effectiveness of the proposed model, we conduct comprehensive experiments on the publically available dataset. The results demonstrate that our model consistently outperforms other state-of-the-art methods. The overall accuracy of the proposed model reached $99.41 %,$ $99.93 %,$ $98.35 %$ $99.59 %$ four datasets.

Résumé

L’image hyperspectrale (HSI) peut fournir des informations spectrales riches, ce qui peut être utile pour une classification précise dans de nombreuses applications. Cependant, la classification d’images hyperspectrales présente des défis, notamment en ce qui concerne le nombre limité des données étiquetées, la redondance des données, une rareté des données et un déséquilibre dans le nombre d’échantillons par classe. Au fil du temps, diverses méthodes ont été proposées pour résoudre les problèmes susmentionnés. Dans cet article, nous présentons le transformateur d’attention de voisinage avec une technique de décalage par canal pour la classification d’images hyperspectrales. Le transformateur d’attention de voisinage tire parti de la puissance du mécanisme d’attention pour capturer la relation spatiale entre les pixels voisins et extraire des entités discriminantes. Les techniques de décalage par canal permettent au modèle d’ajuster les caractéristiques spectrales de chaque canal de manière adaptative, améliorant ainsi sa capacité à gérer les variations spectrales présentes dans les données hyperspectrales. Pour valider l’efficacité du modèle proposé, nous avons mené des expériences exhaustives sur l’ensemble de données accessible au public. Les résultats montrent que notre modèle surpasse constamment les autres méthodes de la litératire. La précision globale du modèle proposé a atteint $99.41 %,$ $99.93 %,$ $98.35 %$ $99.59 %$ dans le cas de quatre ensembles de données.

Keywords:

Introduction

Hyperspectral sensors capture Hyperspectral image data. The hyperspectral image data consist of hundreds of narrow bands with a lot of information. Hyperspectral image classification (HSI) is a part of earth observation extensively used in agriculture, physics, astronomy, chemical imaging, and environmental sciences (J.M. Bioucas-Dias et al., Citation2013). Hyperspectral image classification (HSI) data capture spectral information from numerous contiguous spectral bands of surface objects. In the hyperspectral image classification, there are three main challenges.

First, in the spectral dimension of hyperspectral image data, hundreds of bands exist and have a lot of information. Still, the bands values and information are frequently redundant, which requires a lot of computational costs.
Secondly, in hyperspectral images, pixels are mixed, interfering with the classification since one pixel usually corresponds to a different object category.
Third, manually labeling HSI samples is costly and time-consuming, leading to limited availability of labeled samples. Numerous approaches have been proposed in the last decade to address these challenges. This work proposes to solve the problem of limited training samples in HSI data.

Machine learning techniques such as the Bayesian approach (Sahin et al., Citation2018), multinomial logistics regression (Haut et al., Citation2017), support vector machine (SVM) (Ye et al., Citation2022), and many more methods have been developed for analyzing hyperspectral image data. These approaches primarily focused on spectral features while disregarding spatial information, which could yield suboptimal outcomes. Furthermore, researchers have proposed classification methods that rely on feature extraction or dimensionality reduction (Villa et al., Citation2011) and linear discriminative analysis (Ye et al., Citation2018). Researchers have proposed different techniques (Zhou et al., Citation2015) to extract spectral-spatial features, improve the representation of hyperspectral image data, and increase accuracy (Zhang and Lin Citation2010). The hyperspectral image field recently experienced the deep learning model (Zhang et al., Citation2018). It achieved significant results due to the deep learning model’s remarkable ability to extract spectral-spatial features. (Özdemir et al., Citation2014) first introduced a stacked autoencoder for classifying hyperspectral images. (Liu et al., Citation2017) This research examined the effectiveness of a deep belief network (DBN)-based model in extracting deep spectral feature maps. This approach involves an initial unsupervised pretraining step on unlabeled samples, which is subsequently followed by supervised fine-tuning using labeled samples. The aforementioned models have highly trainable parameters and face challenges regarding spatial information loss, as they require a dimensional input format. Convolutional neural networks made a breakthrough in classification accuracy (Ahmad et al., Citation2022; Gao et al., Citation2020; Guo and Zhu Citation2019; Lee and Kwon Citation2017; Yu et al., Citation2020). Compared with a fully connected network, a convolutional neural network (CNN) effectively utilizes the local connection and shared parameters to extract 2D spatial features. Furthermore, (Yang et al., Citation2016) employed a two-channel convolutional neural network (CNN) to extract spatial and spectral features for hyperspectral image classification (HSI) classification independently. Since hyperspectral data can be represented as a 3D data cube, using 3D kernels to extract spectral-spatial features is intuitive. While the deep learning model mentioned above has made significant improvements in HSI classification, the pursuit of achieving higher classification accuracy with fewer labeled samples remains a crucial objective in HSI classification research (Wang et al., Citation2021)

This is because convolutional neural networks struggle with modeling long-range dependencies and capturing global context information effectively (Tan et al., Citation2021). On the other hand, the transformer model utilizes comprehensive contextual information across a wide range by treating the input image as consecutive patches. Initially introduced for machine translation, transformers have emerged as the state-of-the-art approach in various natural language processing (NLP) tasks (Vaswani et al., Citation2017). In hyperspectral image classification (He et al., Citation2020), hyperspectral image classification first employed BERT to capture the global dependencies between pixels. (Yu et al., Citation2022) proposed a multilevel spectral-spatial transformer, improving the classification accuracy by fusing the multilevel features. Sun et al. (Citation2022) proposed a spectral-spatial feature tokenization transformer for hyperspectral image classification. To extract low-level features, the author employed 3D and 2D convolutional neural networks, while high-level semantic features were extracted using a Gaussian weighted tokenizer with a multi-head self-attention transformer. Moreover, these models employ a multi-head self-attention mechanism (MHSA). In hyperspectral image classification, using such transformers in remote sensing is challenging. The author proposed a local and global attention mechanism to extract discriminating features. (Ouyang et al., Citation2023) the author proposed a hybridFormer using a convolutional neural network and a transformer network. The author proposed a spectral-spatial attention mechanism instead of multihead self-attention. These transformer models required an insufficient number of training samples to extract more discriminating features. Motivated by the transformer attention mechanism, we proposed a neighborhood attention mechanism for hyperspectral image classification. The main contributions to this work are as follows:

The channel-wise shift technique is applied to recalibrate the channel feature information by emphasizing informative spectral bands while simultaneously attenuating the less valuable ones.
To utilize the capabilities of convolutional neural networks (CNN), we combine the 3D and 2D convolutional neural network models to capture low-level features.
The proposed model utilizes the neighborhood attention mechanism to obtain long-range dependencies instead of self-attention, which is computationally too expensive to improve the generalization capability of the model.

The remainder of the paper is as follows Section II provides the details of the proposed model, Section III is about Dataset and Performance evaluation, Section IV describes the result and discussions Section V is conclusion.

Proposed Methodology

In this section, we describe our proposed model shown in . The proposed model consists of three main components: channel shift mechanism, spectral-spatial feature extraction, and neighborhood attention transformer.

Figure 1. Proposed neighborhood attention transformer.

Principle Component Analysis (PCA) Process

We adopt the principle component analysis (PCA) step in our proposed work. Some important features are extracted by PCA operation. Typically, in hyperspectral image classification, there are limitations in the number of training samples. This can lead to overfitting issues when training complex model. To address these challenges, we employ PCA techniques in addition, it reduces the dimension of spectral bands and maintains the spatial information as much as possible. It measures the importance of each direction by comparing the variance of data in projection space. The greater variance of the data the more information it contains. In precise terms, Principle Component Analysis optimizes the intrinsic information within after reducing its dimensionality. It evaluates the significance of the projected direction by quantifying data variance. The mathematical process can be succinctly summarized as follows: consider X, which comprises column vectors, each representing a feature vector. By applying algebraic linear transformation to this matrix, the function can be as follows: (1) $min_{x} tr (X^{T} AX), s . t . X^{T} X = I$ (1)

Where tr and T represent the trace and transpose of the matrix, and A is a covariance matrix. Therefore the output data of PCA can be represented by $Y = X^{T} B$ , and the optimal matrix X is composed of the eigenvectors corresponding to the largest n eigenvalues in front of the covariance matrix as column vectors, thus reducing the original dimension of B to the n dimension.

Channel-wise Shift

This work uses the channel-wise shift technique to enhance feature extraction capability and improve classification accuracy. Before incorporating feature information into the convolution module, essential features undergo pre-processing via PCA and channel shift operation. Consequently, the spectral bands in HSI data, following PCA, are organized in descending order of importance. The bands that preserve more spectral data contribute substantially to subsequent feature extraction. The channel-wise shift scheme involves moving the relatively more critical spectral bands at the margins, requiring fewer computations for information retrieval shows the channel wise shift operation.

Figure 2. Channel wise shift scheme.

This approach enhances the count of spatial feature extraction operations for informative channels and guarantees their inclusion in the central region of efficient receptive fields. As a result, important spectral bands are positioned in the middle of all channels enabling more convolution operations. Let k be a channel at ( $i, j$ ) position. Suppose we need to move $x_{k}$ the band into the margin position $\begin{array}{l} X = (x_{1}, x_{2}, \dots \dots \dots, x_{i}, x_{i + 1}, \dots \dots \dots, x_{k - 1}, x_{k}) \\ \to X = (x_{1}, x_{2}, \dots \dots \dots, x_{i}, x_{k}, x_{i + 1}, \dots \dots \dots, x_{k - 1}) \end{array}$ . (2) $\begin{array}{l} X = (x_{1}, x_{2}, \dots \dots, x_{i}, x_{i + 1}, \dots \dots, x_{k - j - 1}, x_{k - j}, \dots, x_{k}) \\ \to X = (x_{1}, x_{2}, \dots \dots, x_{i}, x_{k - j}, \dots \dots, x_{k}, x_{i + 1}, \dots, x_{k - j - 1}) \end{array}$ (2) (3) $\{\begin{matrix} X (:, :, B - 2 \times i - 1) \to X (:, :, i) i \in [0, N / 2) \\ X (:, :, 2 \times i - B) \to X (:, :, i) i \in [N / 2, N] \end{matrix}$ (3)

The linear operation of channel shift technique can be follow as EquationEquation (3)(3) $\{\begin{matrix} X (:, :, B - 2 \times i - 1) \to X (:, :, i) i \in [0, N / 2) \\ X (:, :, 2 \times i - B) \to X (:, :, i) i \in [N / 2, N] \end{matrix}$ (3) .

Feature Extraction

Hyperspectral image data is represented $X \in R^{h \times w \times b}$ where h represents the height of the image w represents is width of the image and b represents is hyperspectral bands. The entire pixel within the region of interest is categorized into p classes, which are represented by $L = (l_{1}, l_{2}, l_{3}, \dots \dots . . l_{p})$ . To harness the feature extraction abilities of CNNs, we implemented a hierarchical architecture consisting of 3D and 2D CNNs as the backbone network. The 3D adjacent patch $Q \in R^{(w \times w) \times b}$ where w is the small window size of a small 3D cube.

The 2D convolutional layer focuses on extracting the spatial correlation of each channel in given HSI image data. In the 3D convolutional layer, the correlation between different channels is also used to improve the ability of feature representation to obtain spectral and spatial feature maps. To be precise, the 2D convolutional layer is used to extract spatial features, but it cannot obtain significant features from spectral bands. Conversely, the 3D convolutional layer can extract spectral spatial features. Therefore, independently using 2D and 3D convolutional layers is not a good option. For the first layer of the 3D convolution filter and kernel parameter set ( $8 \times 3 \times 3 \times 7$ ), ( $16 \times 3 \times 3 \times 5$ ), ( $32 \times 3 \times 3 \times 3$ ) means eight filters with 3D kernel size ( $3 \times 3 \times 7$ ) dimension. After completing the 3D convolution operation, we reshape the output of the layer into three dimensions by concatenating the third and fourth dimensions. Finally, we use a 2D convolution layer with the ( $3 \times 3$ ) kernel to reduce the number of spectral bands. The feature map generated by the 3D layer has different characteristics, so fusing the 2D convolutional layer with the 3D convolutional neural network layer improves accuracy by involving more complementary information. The mixed-use of 3D and 2D convolutional layers can fully utilize the spectral and spatial information to obtain more discriminative features. As depicted in the a standard 3D convolution operation is illustrated. It becomes evident that the leftmost and rightmost spectral bands receive fewer 2D convolution operations when only a single 3D kernel transverses their surfaces. In contrast, spectral bands positioned toward the undergo a higher number of 2D convolutions, equivalent to the dimensionality of the third dimension of the 3D kernel.

Figure 3. 3D convolution along spectral dimension grey frames represent spectral bands, dark black frames are the filters across spectral bands and red frames represents the filters.

Neighborhood Transformer Encoder

Neighborhood attention (Hassani et al., Citation2023) can effectively capture the data’s local dependencies and spatial relationships. This mechanism is beneficial for image analysis, where neighboring pixels often contain relevant information for understanding the overall object. Localized self-attention that incorporates inductive biases similar to convolutional operations. These inductive biases mimic the behavior of convolutional operations, which are commonly used in advanced model Swin Transformers (Liu et al., Citation2021). By incorporating these inductive biases, neighborhood attention eliminates the need for additional computational overhead, such as pixel shift, explored in Swin Transformer. Neighborhood attention constrains the receptive field of the query token to a specific size of neighboring pixels. The aims to establish a local neighborhood window where smaller neighboring regions are given higher priority with local attention. It often improved control over the receptive field, effectively balancing the properties of translational invariance and equivariance compared to another vision transformer shows the neighborhood attention mechanism. Let us have an input image of the spatial location of a neighborhood pixel at ( $a, b$ ) with the lth feature maps i.e., $(a, b, l)$ within the neighbor window is expressed by $τ (a, b, l)$ which consist of a finite set of indices belonging to pixels close to the location $(a, b)$ of lth feature maps.

Figure 4. Neighborhood attention.

The small window size $S \times S$ a neighborhood of $τ (a, b, l)$ is calculated as Euclidean distance $∥ τ (a, b, l) ∥ = S^{2}$ The attention on a single pixel at location ( $a, b$ ) with the lth feature maps can be expressed in terms of linear projection Q, K, and V of extracted features Y and relative positional biases B( $a, b$ ). The size query is ( $1 \times Q$ ) and both keys and values are ( $N \times Q$ ) for all patch matrix. (4) $q = W_{q} Y, k = W_{k} Y, v = W_{v} Y$ (4)

The dot product between queries and keys and apply a softmax function to calculate the attention map. (5) $C = softmax (q_{(a, b, l)} k^{T} + B_{(a, b, l)} / \sqrt{d})$ (5)

Where $\sqrt{d}$ is a scaling factor. It is vital to notice the distinction between self-attention and neighborhood attention. Self-attention allows each token to interact with all other tokens, while neighborhood restricts the receptive field of each token to its nearest area. As a result, the concept of neighborhood attention offers a unique advantage by inherently limiting each pixel to its adjacent region, without any computational overhead. This need for pixel transition to represent interdependence across different windows. It is crucial to highlight that if the neighborhood size of the pixel equals or greater than the feature map size, the outcomes of the self-attention and neighborhood attention will closely resemble those of the original input feature maps.

In summary, the PCA-reduced cube is embedded into the channel shift operation, where the high information bands are placed in the margin to extract more spectral features. After this process, the features will be passed through a features extraction module consisting of a 3D and 2D convolutional neural network (CNN). The neighborhood attention embeds the output of the feature extractor with two successive convolution layers, with the stride yielding the spatial size of one-fourth of the hyperspectral input data. The neighborhood attention employs overlapping convolution; the model consists of two levels, where the first level has three attention blocks while the second level uses four blocks. The encoder block consists of an attention module and multilayer perceptron (MLP) with layer normalization and residual connection. In each encoder block layer, normalization is added before each attention module and an MLP layer, and residual connections are added after each attention module and MLP layer.

Dataset Description and Parameter Analysis

Dataset Description

For the experiment, we use four publicly available hyperspectral datasets including the Xuzhou dataset, Salinas Dataset, Gulf Port Dataset, and Botswana. The Xuzhou dataset acquires a HYPEX spectral camera over the Xuzhou peri-urban site. The spatial dimension of this dataset is $500 \times 260$ pixels with a 0.73 m/pixel spatial resolution, and the spectral bands range is 415 to 2508 after removing the noisy band remaining is 436. This dataset has 9 category labeled classes shows the detail of this dataset. The second dataset is the Salinas dataset acquired by (AVIRIS) sensor. The spatial dimension of this dataset is $512 \times 217$ pixels and the wavelength range is from 400 nm to 2500 nm with a 3.7 m/pixel and 10 nm spatial and spectral resolution, and the spectral bands 224 after removing the noisy band remaining is 204. This dataset has 16 robust labeled classes. shows the details of the dataset.

Table 1. Xuzhou dataset labeled samples.

Display Table

Table 2. Salinas dataset labeled sample.

Display Table

The third dataset is the Mississippi GulfPort (GP) dataset acquired from the University of Southern Mississippi Gulf Park campus. The dataset consists of 72 spectral bands with the spatial resolution of $185 \times 89$ pixels with wavelength range is 375 nm to 1050 nm with a bandwidth of 10 nm. The dataset has 6 land cover labeled classes, the details of this dataset are shown in . The fourth dataset is Botswana acquired by Hyperion sensor in the Okavango Delta area of Botswana. The spatial resolution of this dataset is $1476 \times 256$ pixels with the wavelength from 400 to 2500nm in which the spatial resolution is 30 m and spectral resolution is 10 nm. The dataset contains 145 spectral bands with a spectral resolution is 10 nm. This dataset contains 3248 labeled samples grouped with 14 land cover categories, the details of this dataset including training and testing dataset size in .

Table 3. GulfPort dataset labeled samples.

Display Table

Table 4. Botswana dataset labeled samples.

Display Table

Experimental Configuration and Setup

All the experiments were performed on NVIDIA RTX 3060 GPU memory with 64GB RAM using PYTHON 3.8 environment, the initial learning rate is set to 0.001 weight decay 0.00001 with Adam optimizer, Batch size is 64, Epochs set to 100. We chose 200 samples randomly from all classes of four datasets to train the model and the remaining samples were used to test the model. All of the experiments were conducted ten times and average results were recorded. The details of the train and test samples are given in .

Selection of PCA

HSI contains much redundant information in the spectral bands. Therefore, extracting sufficient spectral information from the image is very challenging. The comparative experiments are conducted without dimensionality reduction, principal component analysis (PCA), and linear discriminative analysis (LDA) pre-processing methods on the Botswana dataset, and the results are shown in .

Table 5. Dimensionality reduction results on botswana dataset.

Download CSV Display Table

Comparative experiment results show that the principle component analysis (PCA) is a very effective technique and achieves the best classification results; in this context, we adopt PCA.

Influence of Patch Size

Input cubes of a specific size cantered around the pixels are selected to leverage the spectral-spatial information in HSI. Optimal experimental results are shown in .

Figure 5. OA comparison at different patch sizes.

OA increases and then decreases when the patch size is on the Gulfport dataset because a larger patch size increases information but introduces a noise factor. When the patch size increase the accuracy decreases for all dataset we chose $9 \times 9$ patch size.

Selection of Attention Kernel Size

To constrain the reception field from the image for attention mechanism the kernel size is used it is called attention kernel. Ablation experiments are done on Botswana, GP and Xuzhou dataset with different kernel sizes ranging from $3 \times 3$ to $9 \times 9$ the results are shown in . For this experiment, we set $7 \times 7$ kernel size.

Table 6. Performance on Botswana, GP and Xuzhou dataset with different kernel sizes.

Display Table

Result and Analysis

We evaluate the performance of this model by using three widely used metrics: Overall accuracy (OA), Average Accuracy (AA), and Kappa coefficient (K). For comparison, we select some backbone methods, including SSSAN (Zhang et al., Citation2022), HiT (Yang et al., Citation2022), SSTN (Zhong et al., Citation2022), CT Mixer (Zhang et al., Citation2022), HybridFormer (E. et al. et al., 2023), Transnet (Chen, Lu et al., Citation2021) and Transfuse (Liu et al., Citation2021). All the comparison models are transformer-based models consisting of different attention modules. The classification result on four datasets with comparison methods is shown in .

show the classification maps generated by each model on the Xuzhou, Salinas, GulfPort, and Botswana datasets. Among these models proposed models show the best classification results with less noise in the classification map. It can be seen proposed models achieved the best classification results with overall accuracy (OA), average accuracy (AA), and kappa (K) reaching $99.41 %,$ $99.33 %$ and $99.25 %$ . However, Hyperspectral Image Transformer Model (HiT) is not capable of obtaining long-range dependencies features due to the use of 3D Convolutional layers. The spectral-spatial self-attention Transformer network obtains better results due to the use of an attention mechanism that is not computationally costly to extract high-level features with small training samples. The proposed model with the Xuzhou dataset improves $2.97 %,$ $13.9 %,$ $6.64 %,$ $4.91 %,$ $8.08 %$ , $6.39 %,$ $3.77 %$ overall (OA) compared to SSSAN, HiT, SSTN, CT Mixer, HybridFormer, transUnet and transfuse models. The classification accuracy of classes varies due to an imbalance of data and different class similarities with ground objects. CT Mixer performs well and captures high-level feature with the use of a convolutional neural network with a transformer network CT Mixer consists of a Local-Global multihead attention mechanism to extract more discriminate features. OA performance on the Xuzhou dataset is very well.

Figure 6. Classification maps on Xuzhou dataset (a) SSSAN, (b) HiT, (c) SSTN, (d) CT Mixer, (e) HybridFormer, (f) TransUnet, (g) Transfuse, (h) Proposed.

Figure 7. Classification map on Salinas dataset. (a) SSSAN, (b) HiT, (c) SSTN, (d) CT Mixer, (e) HybridFormer, (f) TransUnet, (g) Transfuse, (h) Proposed.

Figure 8. Classification maps on GulfPort dataset. (a) SSSAN, (b) HiT, (c) SSTN, (d) CT Mixer, (e) HybridFormer, (f) TransUnet, (g) Transfuse, (h) Proposed.

Figure 9. Classification maps on Botswana dataset. (a) SSSAN, (b) HiT, (c) SSTN, (d) CT Mixer, (e) HybridFormer, (f) TransUnet, (g) Transfuse, (h) Proposed.

The proposed model achieves the highest accuracy in all classes with small training samples because of the compact attention module and better selection of kernel size. Different models using 200 training samples generate the classification map on the Xuzhou dataset as shown in . The classification maps generated by the proposed model exhibit lower level of noise in comparison to other models. The quality of SSSAN model and proposed model almost looks better compared to other models.

The classification result on the Salinas dataset is shown in . The proposed model achieves the best result with no misclassification. Both the CT Mixer and the proposed model nearly achieve good results in terms of OA. In some classes, other models also perform well such as class two, and class nine. The proposed model improves $4.15 %,$ $6.58 %,$ $18.67 %,$ $3.12 %,$ $7.07 %$ , $7.33 %,$ $6.5 %$ on average accuracy (AA) compared to SSSAN, HiT, SSTN, CT Mixer, HybridFormer, transUnet and transfuse. In addition, in some classes, SSSAN, HiT, and CT Mixer achieved the best classification results. Classification maps are also shown in noise in the CT Mixer model and the proposed model is nearly acceptable.

The classification result of the Gulfport dataset is shown in . The proposed model improves $0.69 %,$ $12.86 %,$ $3.83 %,$ $2.55 %,$ $11.78 %$ , $4.78 %,$ $3.68 %$ on the Kappa coefficient compared to SSSAN, HiT, SSTN, CT Mixer, HybridFormer, transUnet and transfuse models. In some classes such as classes one, two, and five achieved the best results from SSSAN and SSTN models. The average accuracy of SSSAN is better than the proposed model. The classification map is shown in the Ground measurement shows that the proposed model has less noise.

The classification results on the Botswana dataset are shown in . As we can see, the proposed model achieved more satisfactory results than other state-of-the-art models. The SSSAN model and the proposed model almost achieved the best classification results.

The proposed model improve $2.02 %,$ $4.01 %,$ $7.98 %,$ $3.24 %,$ $5.06 %$ overall accuracy compared to other state-of-the-art models. Classification maps obtained from all models are shown in . The proposed model classification map is nearly to the ground measurement image, proposed model has less noise effect on ground objects.

Visualization

To show the representative ability of the proposed model, we visualized the extracted features and feature maps in 2-D space via the t-SNE (van der Maaten and Hinton Citation2008) as shown in , samples from the same class are clustered into one group. In contrast, samples from different classes become easily separable from each other’s. The proposed model can learn the abstract representation of spectral-spatial features well.

Figure 10. Visualization of 2D spectral spatial features on different dataset via t-SNE. (a) Xuzhou, (b) Salinas, (c) GP, and (d) Botswana dataset.

Ablation Experiments

The proposed model ablation experiments are carried out on different modules to improve the effectiveness further. Two modules are used in the proposed model: convolutional neural network (CNN) and transformer network with neighborhood attention. shows the ablation classification results regarding OA, AA, and Kappa coefficients. With the addition of the Neighborhood attention mechanism, accuracy is improved compared to the CNN model. Only the convolutional neural network is not able to extract long-range dependency information with the addition of attention mechanism accuracy $4.55 %,$ $4.02 %,$ and $4.54 %$ improvement. The parameters of the Proposed attention technique a very few.

Table 11. Ablation experiments in terms of OA, AA, and Kappa with different modules.

Download CSV Display Table

Time Cost Comparison

A comparison of training and testing for HybridFormer, HiT, SSSAN, CT Mixer, TransUnet, transfuse and proposed model is shown in . It can be seen that the proposed model relatively faster than the other models except for SSSAN, CT Mixer, TransUnet and transfuse. Therefore, proposed model can effectively reduce the computation time and improve classification accuracy. The proposed model take relatively long time due to the use of 3D convolutional layer to extract spectral spatial features.

Table 12. Training and testing time of different methods on Botswana dataset.

Download CSV Display Table

Impact of Training Samples

Further, evaluate the performance of different models with varying training samples, i.e., 200, 300, 500, and 1000 for the Botswana dataset. shows the OA values of different models with varying training sample sizes. Overall, the proposed model consistently performs well another one on all training samples. The poor performance was given by SSTN and HybridFormer models. When the larger training samples provided the proposed model, SSSAN, CT Mixer, and HiT performed closely. However, when a small training sample is used, the robustness (Zhang and Lin Citation2010) of the proposed model can be observed by comparing the more significant differences among other models. This is mainly because we use the neighborhood attention mechanism to extract more discriminate features.

Figure 11. OA achieved by different methods with varying training samples.

Conclusion

In this paper, we propose a neighborhood attention transformer with a channel strategy for hyperspectral image classification to solve the problem of limited training samples. Initially, the model uses a 3D and 2D convolutional neural network to extract local spectral spatial features. Then, extracting more abstract features using the neighborhood attention mechanism is proposed. The conducted experiments on four HSI datasets validate the proposed model’s effectiveness and exhibit its superiority over state-of-the-art models. The suggested model demonstrates superior adaptability to small sample image classification compared to other methods, resulting in more consistent and accurate classification outcomes. Furthermore, ablation studies were conducted on different modules demonstrates its significant impact in enhancing the model’s performance. This research thoroughly examines the spatial and spatial-spectral information, but neglects to account for the impact of spectral dimension information. The subsequent phase will comprehensively incorporate the multi-dimensional data from hyperspectral images into the learning and classification process. Additionally, it will integrate an adaptive feature extraction module to enhance the optimization of information across various dimensions. Furthermore, there is a growing preference for lightweight network models in the future. Therefore, the focus should be on developing highly adaptable and widespread lightweight models.

Acknowledgments

The author would like to thank potential supervisors, editors, and reviewers for their advice and comments.

Disclosure Statement

The author reported no conflict of interest.

Data Availability Statement

The data use for this research available on http://www.ehu.eus/ccwintco/index.php?title = Hyperspectral_Remote_Sensing_Scenes

Additional information

Funding

The work was supported by National Natural science foundation of china under Grant 62271171.

References

Ahmad, M., Khan, A.M., Mazzara, M., Distefano, S., Ali, M., and Sarfraz, M.S. 2022. “A fast and compact 3-D CNN for hyperspectral image classification.” IEEE Geoscience and Remote Sensing Letters, Vol. 19: pp. 1–19. doi:10.1109/LGRS.2020.3043710.
Web of Science ®Google Scholar
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., and Zhou, Y. 2021. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
Google Scholar
Ouyang, E., Li, B., Hu, W., Zhang, G., Zhao, L., and Wu, J. 2023. “When multigranularity meets spatial–spectral attention: A hybrid transformer for hyperspectral image classification.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 61: pp. 1–18. doi:10.1109/TGRS.2023.3242978.
Google Scholar
Gao, K., Guo, W., Yu, X., Liu, B., Yu, A., and Wei, X. 2020. “Deep induction network for small samples classification of hyperspectral images.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 13: pp. 3462–3477. doi:10.1109/JSTARS.2020.3002787.
Web of Science ®Google Scholar
Guo, A.J., and Zhu, F. 2019. “A CNN-based spatial feature fusion algorithm for hyperspectral imagery classification.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 57(No. 9):pp. 7170–7181. doi:10.1109/TGRS.2019.2911993.
Web of Science ®Google Scholar
Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. 2023. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6185–6194. arXiv preprint arXiv:2204.07143
Google Scholar
Haut, J., Paoletti, M., Paz-Gallardo, A., Plaza, J., Plaza, A., and Vigo-Aguiar, J. 2017. “Cloud implementation of logistic regression for hyperspectral image classification.” In Proc. 17th Int. Conf. Comput. Math. Methods Sci. Eng, Vol. 3: pp. 1063–2321. doi:10.1109/JMASS.2020.3019669.
Google Scholar
He, J., Zhao, L., Yang, H., Zhang, M., and Li, W. 2020. “HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 58(No. 1): pp. 165–178. doi:10.1109/TGRS.2019.2934760.
Web of Science ®Google Scholar
Bioucas-Dias, J.M., Plaza, A., Camps-Valls, G., Scheunders, P., Nasrabadi, N., and Chanussot, J. 2013. “Hyperspectral remote sensing data analysis and future challenges.” IEEE Geoscience and Remote Sensing Magazine, Vol. 1(No. 2): pp. 6–36. doi:10.1109/MGRS.2013.2244672.
Web of Science ®Google Scholar
van der Maaten, L., and Hinton, G. 2008. “Visualizing data using t-SNE.” The Journal of Machine Learning Research, Vol. 9 pp. 2579–2605.
Web of Science ®Google Scholar
Lee, H., and Kwon, H. 2017. “Going deeper with contextual CNN for hyperspectral image classification.” IEEE Transactions on Image Processing, Vol. 26(No. 10): pp. 4843–4855. doi:10.1109/TIP.2017.2725580.
PubMed Web of Science ®Google Scholar
Liu, P., Zhang, H., and Eom, K.B. 2017. “Active deep learning for classification of hyperspectral images.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 10(No. 2): pp. 712–724. doi:10.1109/JSTARS.2016.2598859.
Web of Science ®Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., and Guo, B. 2021. “Swin transformer: Hierarchical vision transformer using shifted windows.” In Proceedings of the IEEE/CVF international conference on computer vision, pp.10012–10022.
Google Scholar
Özdemir, A.O.B., Gedik, B.E., and Çetin, C.Y.Y. 2014. “Hyperspectral classification using stacked autoencoders with deep learning.” 2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), pp. 1–4. doi:10.1109/WHISPERS.2014.8077532.
Google Scholar
Sahin, Y.E., Arisoy, S., and Kayabol, K. 2018. “Anomaly detection with Bayesian Gauss background model in hyperspectral images.” 2018 26th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. doi:10.1109/SIU.2018.8404293.
Google Scholar
Sun, L., Zhao, G., Zheng, Y. and Wu, Z., 2022. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, Vol. 60: pp. 1–14.
Google Scholar
Tan, X., Gao, K., Liu, B., Fu, Y., and Kang, L. 2021. “Deep global-local transformer network combined with extended morphological profiles for hyperspectral image classification.” Journal of Applied Remote Sensing, Vol. 15(No. 03): pp. 038509. doi:10.1117/1.JRS.15.038509.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., and Polosukhin, I. 2017. “Attention is all you need.” Advances in Neural Information Processing Systems, Vol. 30: pp. 30.
Google Scholar
Villa, A., Benediktsson, J.A., Chanussot, J., and Jutten, C. 2011. “Hyperspectral image classification with independent component discriminant analysis.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 49(No. 12):pp. 4865–4876. doi:10.1109/TGRS.2011.2153861.
Web of Science ®Google Scholar
Wang, G., Zheng, X., Cheng, L., Wan, X., and Guo, Z. 2021. “Hyperspectral image classification based on improved few shot learning.” In 2021 IEEE International Conference on Electronic Technology, Communication and Information (ICETCI), pp. 673–676. doi:10.1109/ICETCI53161.2021.9563257.
Google Scholar
Zhang, W. J., Yang, G., Lin, Y., Ji, C., and Gupta, M. M. 2018. "On Definition of Deep Learning," 2018 World Automation Congress (WAC), Stevenson, WA, USA, 2018, pp. 1–5, doi:10.23919/WAC.2018.8430387.
Google Scholar
Yang, X., Cao, W., Lu, Y., and Zhou, Y. 2022. “Hyperspectral Image Transformer Classification Networks.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 60: pp. 1–15. doi:10.1109/TGRS.2022.3171551.
Web of Science ®Google Scholar
Zhang, X., et al. 2022. “Spectral–spatial self-attention networks for hyperspectral image classification.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 60: pp. 1–15. doi:10.1109/TGRS.2021.3102143.
Google Scholar
Yang, J., Zhao, Y., Chan, J. C. W., and Yi, C. 2016. “Hyperspectral image classification using two-channel deep convolutional neural network.” In 2016 IEEE international geoscience and remote sensing symposium (IGARSS), pp. 5079–5082. doi:10.1109/IGARSS.2016.7730324.
Google Scholar
Ye, Q., Huang, P., Zhang, Z., Zheng, Y., Fu, L., and Yang, W. 2022. “Multiview learning with robust double-sided twin SVM.” IEEE Transactions on Cybernetics, Vol. 52(No. 12): pp. 12745–12758. doi:10.1109/TCYB.2021.3088519.
PubMed Web of Science ®Google Scholar
Ye, Q., Yang, J., Liu, F., Zhao, C., Ye, N., and Yin, T. 2018. “L1-norm distance linear discriminant analysis based on an effective iterative algorithm.” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28(No. 1): pp. 114–129. doi:10.1109/TCSVT.2016.2596158.
Web of Science ®Google Scholar
Yu, C., Han, R., Song, M., Liu, C., and Chang, C.I. 2020. “A simplified 2D-3D CNN architecture for hyperspectral image classification based on spatial–spectral fusion.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 13: pp. 2485–2501. doi:10.1109/JSTARS.2020.2983224.
Web of Science ®Google Scholar
Yu, H., Xu, Z., Zheng, K., Hong, D., Yang, H., and Song, M. 2022. “MSTNet: A multilevel spectral–spatial transformer network for hyperspectral image classification.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 60: pp. 1–13. doi:10.1109/TGRS.2022.3186400.
Google Scholar
Zhong, Y., Bi, T., Wang, J., Zeng, J., Huang, Y., Jiang, T., and Wu, S. 2022. “Spectral–spatial transformer network for hyperspectral image classification: A factorized architecture search framework.” IEEE Transactions on Geoscience and Remote Sensing, Vol. 60: pp. 1–15. doi:10.1109/TGRS.2021.3115699.
Web of Science ®Google Scholar
Zhou, Y., Peng, J., and Chen, C.P. 2015. “Extreme learning machine with composite kernels for hyperspectral image classification.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 8(No. 6): pp. 2351–2360. doi:10.1109/JSTARS.2014.2359965.
Web of Science ®Google Scholar
Zhang, Y., Liu, H., and Hu, Q. 2021. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24 (pp. 14–24). Springer International Publishing.
Google Scholar
Zhang, W.J., and Lin, Y. 2010. “On the principle of design of resilient systems –application to enterprise information systems.” Enterprise Information Systems, Vol. 4(No. 2):pp. 99–110. doi:10.1080/17517571003763380.
Web of Science ®Google Scholar
Zhang, J., Meng, Z., Zhao, F., Liu, H., and Chang, Z. 2022. “Convolution transformer mixer for hyperspectral image classification.” IEEE Geoscience and Remote Sensing Letters, Vol. 19: pp. 1–5. doi:10.1109/LGRS.2022.3208935.
Web of Science ®Google Scholar

Spectral Spatial Neighborhood Attention Transformer for Hyperspectral Image Classification

Transformateur d’attention de voisinage spatial-spectral pour la classification d’images hyperspectrales

Abstract

Résumé

Introduction

Proposed Methodology

Principle Component Analysis (PCA) Process

Channel-wise Shift

Feature Extraction

Neighborhood Transformer Encoder

Dataset Description and Parameter Analysis

Dataset Description

Table 1. Xuzhou dataset labeled samples.

Table 2. Salinas dataset labeled sample.

Table 3. GulfPort dataset labeled samples.

Table 4. Botswana dataset labeled samples.

Experimental Configuration and Setup

Selection of PCA

Table 5. Dimensionality reduction results on botswana dataset.

Influence of Patch Size

Selection of Attention Kernel Size

Table 6. Performance on Botswana, GP and Xuzhou dataset with different kernel sizes.

Result and Analysis

Table 7. Classification result (%) on Xuzhou dataset.

Table 8. Classification result (%) on Salinas dataset.

Table 9. Classification result (%) on GulfPort dataset.

Table 10. Classification result (%) on Botswana dataset.

Visualization

Ablation Experiments

Table 11. Ablation experiments in terms of OA, AA, and Kappa with different modules.

Time Cost Comparison

Table 12. Training and testing time of different methods on Botswana dataset.

Impact of Training Samples

Conclusion

Acknowledgments

Disclosure Statement

Data Availability Statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date