Full article: A hybrid convolution transformer for hyperspectral image classification

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Hyperspectral images play a crucial role in remote sensing applications surveillance, environment and precision agriculture, containing abundant object information. However, they often face challenges such as limited labelled data and imbalanced classes. In recent years, convolutional neural networks (CNNs) have shown impressive performance in computer vision tasks, including hyperspectral image classification. The emergence of transformers has also attracted attention for hyperspectral image analysis due to their promising capabilities. Nevertheless, transformers typically demand a substantial amount of training data, making their application challenging in scenarios with limited labelled samples. To overcome this limitation, we propose a hybrid convolution transformer framework. Our method uses a vision transformer and a residual 3D convolutional neural network model. It also uses a sequence aggregation layer to avoid overfitting issues that come up when there isn’t enough training data. Our proposed residual channel attention module captures richer spatial-spectral complementary information and maintains spectral details during the feature extraction process. We conducted experiments on three benchmark datasets. The proposed model achieved state of the art performance $99.75 %$ , $99.46 %$ and $99.95 %$ in terms of overall accuracy (OA) using only $5 %$ , $5 %$ and $10 %$ labelled training samples respectively. Which is better than other state of the art methods.

KEYWORDS:

Introduction

Hyperspectral image data is acquired by hyperspectral sensors. Hyperspectral has abundant spatial and spectral information. The hyperspectral image is widely used in mineralogy (J. Wang et al., Citation2012), environment (Bioucas-Dias et al., Citation2013), surveillance (Uzkent et al., Citation2017), military (Ardouin et al., Citation2007), agriculture monitoring (De Petris et al., Citation2023; Pirotti et al., Citation2014) and other domains (Liao et al., Citation2023; Schimleck et al., Citation2022; Vaishnavi et al., Citation2022). Hyperspectral data contain a lot of spectral information about surface objects. As remote sensing technology advances, the spatial resolution of hyperspectral imaging data has significantly improved, which has widely enhanced the ability of the hyperspectral image dataset to express different objects. By investigating the spectral information, it was found that data redundancy exists in bands. To reduce the data redundancy, some methods were used, such as linear discriminative analysis (Bandos et al., Citation2009), independent analysis (Villa et al., Citation2011), principle component analysis (Licciardi et al., Citation2011), and some other methods. In the early stages, machine-learning techniques were used for hyperspectral image classification, such as support vector machines (Melgani & Bruzzone, Citation2004), random forests (J. M. Haut et al., Citation2017), logistic regression (Ham et al., Citation2005), kernel-based methods (J. Haut et al., Citation2017), and k-means clustering (Camps-Valls & Bruzzone, Citation2005). However, these traditional techniques can classify hyperspectral images effectively when relying solely on spectral information; the feature learning process does not fully leverage the spatial pixel correlation present in the data. Thus, these methods lead to misclassification and cannot give satisfactory classification results. Ma & Chang, (Citation2021) proposed methods that utilise active learning and iterative training sampling to enhance spectral spatial classification.

With the development of deep learning (LeCun et al., Citation2015), approaches in the field of agriculture (Ganatra & Patel, Citation2021; Santos et al., Citation2020) have made progress in the last few years. Deep learning models have the ability to extract relevant features as compared to manual engineering models. Due to its local connectivity and weight-sharing ability, the convolutional neural network (CNN) has strong feature extraction abilities. For the moment, SAE (Özdemir et al., Citation2014) and DBN (Yao et al., Citation2016) were introduced for the classification of satellite images and land-use classification tasks. These methods are based on one-dimensional (1-D) sample training data. However, alteration of 1-D training data leads to loss of spatial information. As a solution, researchers have explored various network structures based on convolutional neural networks for hyperspectral image classification. An improved classification method was proposed by (W. Hu et al., Citation2015). This method uses a 1-D convolutional neural network with five convolution layers. Also, this method takes spectral information as input and extracts spectral features (Zhao & Du, Citation2016). A 2D-convolutional neural network model was introduced to extract spatial features from the first few principle component analyses after dimensionality reduction. Yang et al., (Citation2017) proposed a dual branch spectral spatial network using a 1-D and 2-D convolutional neural network to extract spatial spectral features. From this method, joint spectral spatial features are extracted. For effective spectral spatial feature extraction Chen et al., (Citation2016) introduce the 3D convolutional neural network method. Roy et al., (Citation2019) proposed a hybrid network with a 3D and 2D convolutional neural network model to effectively extract spectral spatial features. This method could reduce the computational complexity. Recently, the residual model (Zhong et al., Citation2017) was adopted for hyperspectral image classification to increase the number of layers and extract more discriminative features. In W. Wang et al., (Citation2018), an end-to-end dense convolutional model was proposed to extract spectral spatial features. Apart from the convolutional neural network, other network categories have shown promising performance in the hyperspectral image classification. It leverages convolution and pooling layers to effectively learn deep features from the data. In addition, feedforward and recurrent neural networks (Hang et al., Citation2019) are used for hyperspectral image classification (Wu & Prasad, Citation2017). proposed the model combination of a convolutional neural network and a recurrent neural network (RNN). The first convolutional neural network extracts former features, and then the recurrent neural network is used to extract further contextual information. Meanwhile, alternative network architectures have been explored for hyperspectral image classification, including those based on generative adversarial networks (Zhu et al., Citation2018), capsule networks (Paoletti et al., Citation2018), and graph networks (Hong et al., Citation2020). These underlying networks effectively capture the interdependencies among pixels and demonstrate a high level of robustness. To some extent, the convolutional neural network-based methods mentioned above have limited ability to fully capture the correlation information between the spectral and spatial domains in hyperspectral image data.

Recently, vision transformers (ViT) (Dosovitskiy et al., Citation2020) have gained great attention in the field of computer vision due to their long-range dependencies. The attention module is used to describe the global context of the image through positional encoding. Researchers have conducted work in the field of hyperspectral image data (He et al., Citation2021). The spectral spatial transformer model was proposed, and the VGGNet structure was used to extract the spectral spatial feature. Qing et al., (Citation2021) use a spectral attention module combined with a multihead transformer module to effectively capture the continuous spectral relationship (Hong et al., Citation2021). A new model was developed called spectralFormer, which is used to learn spectral information from group-wise neighbouring bands and, at last, uses a cross-layer transformer encoder. Sun et al., (Citation2022) proposed the spectral spatial feature tokenization transformer (SSFTT) model to capture semantic features with the tokenization process. Zhang et al., (Citation2022) proposed a convolution transformer mixer to extract local global information through a combination of a convolutional neural network and a transformer. While the aforementioned methods can capture global information to some extent, they lack the ability to capture local information effectively.

The process of labelling hyperspectral image data takes much labour and is time-consuming, leading to a scarcity of labelled samples available for classification tasks. Consequently, the challenge lies in obtaining a better classification result while using as few labelled samples as possible. This topic has been a focal point of research, aiming to optimise classification performance under limited data scenarios. Small sample hyperspectral classification can be approached through various strategies, including generative adversarial networks (GANs) (Zhong et al., Citation2019), semi-supervised learning, and network optimization methods (Feng et al., Citation2019). The backbone of network optimisation is residual connection, aiming to enhance feature fusion and reuse capabilities within the model. Indeed, network optimization holds significant research significance, especially in the context of “small sample data”. Building a network with a more reasonable structure can prove to be an effective solution to address the challenges of limited labelled data.

To address the limitations of previous methods, we propose a hybrid convolutional transformer that uses a convolutional neural network and a vision transformer to extract joint spectral spatial features with both local and global information at various scales from small training samples. Attention mechanisms are added to both the convolutional neural network and transformer model. By employing this model, the hyperspectral image classification task can effectively leverage both local and global information, leading to improved network representation. The main contributions are as follows:

An effective approach is proposed for HSI classification, aiming to leverage both local and global information through utilization of a powerful residual 3D convolutional neural network model.
To mitigate the loss of spectral features, a residual channel attention mechanism is proposed after the depth wise convolution layer. This mechanism effectively extracts spectral spatial joint features, ensuring the preservation of valuable information.
Introducing a sequence aggregation layer helps mitigate overfitting caused by limited training samples. This layer helps to alleviate overfitting issues by aggregating information from the sequence and reducing model complexity.

The remaining of the article as follows: section II describes the proposed Methodology section III describes dataset and experimental evaluation, section IV result and discussion, section V conclusion.

Proposed methodology

. Shows the proposed model for HSI classification composed of three parts the residual 3D Convolutional Neural Network for local feature extraction, residual channel attention module and Vision Transformer module. Let $X_{hsi} \in R^{H \times W \times C}$ represent original HSI data where H, W and C represent height width and channel respectively. Suppose that $X_{hsi}$ contain M labelled pixel $X = {x_{1}, x_{2} \dots, x_{M}} \in R^{1 \times 1 \times C}$ and their corresponding one hot label $Y = {y_{1}, y_{2}, \dots, y_{M}} \in R^{1 \times 1 \times K}$ where K represents number of classes. The spatial size of the centre pixel represent $S \times S$ can be defined a spectral spatial vector. After the data notation, all hyperspectral image (HSI) data was randomly divided into training and testing. The training data is used to optimize the proposed model and obtain the best-trained model tested on the test dataset. After the completion of training and testing, the evaluation of the model is performed using three evaluation matrices called overall accuracy, average accuracy, and Kappa coefficient. Finally, the model classifies the pixels to generate a classification map. First, we apply the Principle Component Analysis (PCA) algorithm to reduce the noise and redundancy in input data while preserving the spatial dimension. Second, we proposed a residual 3D convolutional neural network model to jointly learn spectral spatial features. Third, we proposed a residual channel attention module to preserve important spectral information in bands. Finally, the vision transformer with the sequence-aggregation layer is used to further extract more representative and discriminative semantic features and perform classification.

Figure 1. Illustration of proposed residual 3D-CNN vision Transformer.

Residual 3D-CNN module

Let the hyperspectral data cube is $X_{hsi} \in R^{H \times W \times C}$ where H represents the height, W represents the width and C represents the spectral bands. The input data first pre-processed by Principle Component Algorithm in spectral dimension, which reduced the redundancy of input data. The spectral bands after Principle Component Analysis (PCA) reduced from C to B, B has a great change on the computational complexity. Using the both spectral spatial information, a small region size of $L \times L$ cantered on the pixel location (i, j). In 3D convolution model, the convolution operation involves applying a 3D convolutional kernel. This operation act on not only spatial dimension but also capture the correlation information between multiple spectral bands. The input dimension is $L \times L \times B$ and the N 3D convolution kernels, the kernel size is $M \times N \times D$ used for convolution operation. The output feature maps is 4D tensor and dimension of this is $(L - M + 1) \times (L - M + 1) \times (B - D + 1) \times D$ the number of bands across spectral dimension is fix and the final result is 4D tensor size of $(L - M + 1) \times (L - M + 1) \times (B - D + 1) \times N$ 3D kernel better utilized the spectral features. Hence, the 3D convolutional Neural Network approach are better suited for hyperspectral image classification that involve abundant spectral information. During the computation of 3D convolution, the activation value at position $(a, b, c)$ on the jth feature map in the ith layer is obtained. The mathematical formula of 3D convolution given in equation:

(1)

x_{i j}^{a b c} = θ (b_{ij} + \sum_{l} \sum_{r = 1}^{d} \sum_{p = 1}^{h} \sum_{q = 1}^{w} w_{i j m}^{p q r} \times x_{i - 1, l}^{a + b, y + p, c + r})

(1)

Where $x_{i j}^{a b c}$ means that the output variable at position $(a, b, c)$ . $θ$ is activation function $w_{ij}$ is depth of the kernel, is a bias. In summary, the proposed residual 3D-CNN model uses three-dimension convolutional layer with different kernel sizes to extract spectral spatial discriminative features. $3 D_CONV 1 = 8 \times 3 \times 3 \times 7$ Where the kernel size is $3 \times 3 \times 7$ and the filter size is 8. $3 D_CONV 2 = 16 \times 3 \times 3 \times 5$ the kernel size is $3 \times 3 \times 5$ and output filter size is 16. $3 D_CONV 3 = 32 \times 3 \times 3 \times 3$ where kernel size is $3 \times 3 \times 3$ and output filter size is 32. The model weights are initialized randomly and then optimized using the Adam optimizer along with back propagation and softmax loss function. During the training mini batch, size is set to 64 and training perform for 100 epochs without applying batchnormlization and augmentation techniques.

We compare the proposed 3D-CNN with the conventional 3D-CNN structure. In the traditional approach, the spectral information is extracted from network using 3D convolutional kernel operation. The extracted features are then directly forward to classification network. When the conventional network becomes too deep, it often suffers from the gradient degradation problem, leading to insufficient extraction of deep spectral features. However, the residual 3D-CNN which effectively addresses the issue of gradient degradation as shown in . Additionally, this approach enables the network to be further deepened, ensuring the maximum extraction of semantic information by the model.

Figure 2. Proposed 3D residual CNN Structure.

Attention module

The attention mechanism, which is extensively employed in deep learning tasks such as image recognition, image segmentation and Natural Language processing (NLP), plays a vital role in selecting pertinent information from a large volume of data. It focuses on highlighting the most significant information for the given task.

The channel attention module is specifically designed to extract spatial spectral information by integrating both depth wise convolution and channel attention mechanism. The channel attention component allows the network to adaptively prioritize and emphasize certain channels based on their relevance to the given task. This component for the limitation of depth wise convolution, which applies a single filter to each input channel and ignores channel specific information. This module implemented in this approach leverages the residual structure to establish an identity mapping between original features and the features obtained after convolution performance by preserving important information from the original features throughout the processing stages.

As shown in . The efficient channel attention module that builds upon the squeeze and excitation network (J. Hu et al., Citation2018). The main approach employed by ECA-NET involve leveraging local-cross channel interaction without dimensionality reduction. This technique effectively reduces the number of network parameter while still allowing for the management of output feature channels with varying weights. As a result, ECA-NET successfully extract crucial features from image while minimizing model complexity. For ECA-NET input dimension $X \in R^{H \times W \times C}$ with their dimension. First, apply Global average pooling layer operation to reduce the parameter count and incorporate the spatial information of the input. A reshape operation is applied to obtain matrix $1 \times 1 \times C$ after that second matrix obtained by one dimension convolution operation $1 \times 1 \times C$ and the last matrix obtained by sigmoid function. Finally, the output feature map is multiply with the input.

Figure 3. Illustration of ECA model.

Vision transformer

Transformer is an innovative deep learning model that leverages the self-attention mechanism and feed-forward neural network. Unlike the convolutional layers typically employed in hyperspectral image classification, the transformer layer calculates feature representation based solely on the self-attention mechanism. This enables the model to capture rich and robust feature representation. A trainable deep learning model can construct by stacking multiple transformer layers. This deep model consists of several parts: image tokenization, positional embedding, classification token, transformer encoder, and classification head.

Image tokenization

In typical transformer model, a sequence of vector called as token is provided as input. When applying transformer to image determining the order of tokens is not straightforward. The vision transformer addresses this challenge by dividing the image into non-overlapping square patches, following a raster scan order. The sequence of patches, $x_{p} \in^{H \times (P^{2} C)}$ with patch size P are flattened into one dimension vector. The simple patching and embedding have some limitations like loss of information.

Positional embedding

Positional embedding is a technique used to incorporate spatial information into the sequence of tokens in a transformer model. Since the model lacks inherent knowledge about the spatial relationship between tokens, providing additional information to represent this relationship can be beneficial. This is commonly achieved by either using a learned embedding or assigning weights to tokens based on two sine waves with high frequencies. By incorporating positional embeddings, the model can learn and recognise the positional relationships that exist between these tokens.

Transformer encoder

Multiple stacked encoding layers compose the transformer encoder. Each encoder layer consists of two sub-layers of multi-head self-attention and a multiple-layer perceptron head. Prior to each sublayer, layer normalization is applied, which helps to normalize the input. After each sublayer, a residual connection is added, allowing the output to flow to the next sublayer. This residual connection preserves the information from the previous sublayer and aids in the flow of gradients during training.

Classification head

In vision transformers, an additional learnable token, often referred to as the class token, is included in the sequence of embedded patches. This token represents the class parameter of the entire image. After passing through the transformer encoder, the state-of-class token can be utilised for classification purposes. The class token plays a crucial role in accumulating information about the sequence through self-attention. It learns to capture important features and contextual information relevant to the classification task.

Sequence aggregation

In the proposed work sequence, aggregation (Hassani et al., Citation2021) was chosen instead of average and global pooling to flatten the data. In some scenarios, it is preferable to choose these techniques. This sequence aggregation is an attention-based technique to pool the sequence data. To preserve the output, sequence information is based on the understanding that it contains relevant information from various parts of the input data. By retaining this information, we can potentially enhance the performance of the model without adding any extra parameters. This operation involves applying a mapping to the output sequence. The output of the sequence aggregation transformer encoder is batch size, sequence length, and total embedding dimension. The output of the sequence aggregation transformer encoder is a linear layer that applies softmax operations. This generates a weight for each input token. After flattening the output, it was sent through the classification head. The sequence pool enables the network to assign weights to the sequential embedding generated by the transformer encoder, allowing for the correlation between input data.

Dataset description and parameter analysis

Study area

The Xuzhou dataset was acquired using an airborne HYSPEX hyperspectral camera, covering the Xuzhou peri-urban site in November 2014. It consists of a spatial size of $500 \times 260$ pixel with an impressive highly spatial resolution of 0.73 m/pixel. For the experiments, 436 spectral bands were utilized, carefully selected after removing the noisy bands ranging from 415 nm to 2508 nm. The scene represent a peri-urban area and its characteristics by nine distinct categories, including crops, vegetation, man-made structured and coal fields. The presence of such diverse and complex mixed categories makes this dataset particularly challenging for classification tasks. Given the very high spatial resolution and the diverse nature of the categories, this dataset poses a significant challenge for hyperspectral image classification and analysis. The mineral classification is conducted on Xuzhou dataset. The total number of training and testing samples can be found in .

Table 1. Xuzhou dataset labelled samples.

Display Table

The second Dataset is Salinas dataset was acquired by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Salinas Valley in California, USA. This Dataset encompasses a spatial dimension of $512 \times 217$ pixel with a spatial resolution of 3.7 m/pixel. The spectral range spans from 400 to 2500 nm, consisting of 204 spectral bands after removal of 20 noisy bands. Within the dataset, there are 16 distinct classes manually labelled, including various land covers and agricultural crops. Alongside the unlabelled pixels, the dataset is enriched with valuable information that can aid in Hyperspectral image classification. The number of training and testing samples used in this experiments can be found in . due to its detailed spectral information and diverse land categories, the Salinas dataset presents a fascinating and demanding scenario for hyperspectral image analysis and classification algorithms.

Table 2. Salinas dataset labelled sample.

Display Table

The dataset used in this study was acquired by the National Aeronautics and Space Administrator (NASA) Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) instruments over Kennedy space centre (KSC), located in Florida, USA, on 23 March 1996. The dataset comprises 224 spectral bands and has a spatial resolution of $512 \times 614$ pixels, with a corresponding spatial resolution of 1.8 m/pixels. The spectral range spans from 400 to 2500 nm with a spectral resolution of 10 nm. The classification task is very challenging due to the presence of 13 distinct land cover types, including water and mixed classes. To optimize the data for classification water absorption and low signal to noise ratio (SNR) bands has been removed, leaving 176 relevant bands for classification. In detailed information regarding the number of Training and Testing samples for each class can be found. The dataset diverse spectral and spatial characteristics, as well as the abundance of labelled pixels, present an excellent opportunity for evaluating and refining hyperspectral image classification models.

Table 3. KSC dataset with Labelled Samples.

Display Table

Experimental configuration and setup

We conduct experiment on NVIDIA 3060 GPU with 64 GB RAM. The code implementation environment is PYTHON 3.8 with TensorFlow library. We evaluate the three performance matrices OA, AA and Kappa coefficient. For the training and testing of model, we select $10 %$ for KSC Dataset for training and $90 %$ for testing, for Xuzhou and Salinas dataset $5 %$ for training and $95 %$ for testing randomly selected samples. The reason to choose small training sample is to solve the problem of small training sample for hyperspectral image data. We choose batch size, Adam optimizer with learning rate and epoch set 64, 0.001 and 100 for all experiment.

Selection of PCA component

Extracting spectral information from Hyperspectral image data is a challenging task due to the presence of redundant and noisy data in the spectral channels. To address this issue comparative experiment were conducted using the Principle Component Analysis (PCA) on three datasets. The number of principle components was selected: 10, 20, 30, 40, and 50. The experimental results shown in show that the 30-principle component gives optimal accuracy for three datasets.

Table 4. OA (%) on different PCAs component.

Download CSV Display Table

Influence of patch size

In the experiment, different spatial patch sizes are selected around the centre pixel, which can also affect classification accuracy. This experiment with different spatial sizes was carried out, as shown in .

Table 5. OA (%) result on different patch size.

Display Table

As we can see in table for the KSC dataset when spatial size increase the accuracy stop improving and for Pavia university same scenario. For Salinas dataset when patch size increase accuracy increase and decrease but for optimal result we set $13 \times 13$ patch size.

Result & discussions

To compare the classification result of proposed model comparative experiments are conducted on three dataset KSC, Salinas and Xuzhou. The comparative methods 2D convolutional neural network (2DCNN) (Zhao & Du, Citation2016), 3D convolutional neural network (3DCNN) (Hamida et al., Citation2018), spectral spatial residual network (SSRN) (Zhong et al., Citation2017), rethinking spatial dimension vision transformer (RVT) (Heo et al., Citation2021), vision transformer (VIT) (Dosovitskiy et al., Citation2020), convolution transformer mixer (CT Mixer) (Zhang et al., Citation2022), and spectral spatial feature tokenization transformer (SSFTT) (Sun et al., Citation2022) are chosen. Classification result compared with other state of the art methods are shown in . All the experimental settings we discussed in Experimental configuration part.

Table 6. Classification result (%) on Salinas Dataset.

Download CSV Display Table

Classification result

The performance indexes are employed to assess the accuracy of the model in this study. These indexes include overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa). The overall accuracy (OA) gauges the proportion of correctly classified samples achieved by the model. It provides an overall assessment of the model’s effectiveness in distinguishing between different classes. The average accuracy (AA) calculates the average precision across all land objects. The kappa coefficient (Kappa) is a valuable accuracy metric derived from the confusion matrix. It quantifies the percentage of error reduced by the model compared to a completely random classification.

By considering these three metrics, the study ensures a thorough evaluation of the model’s performance and ability to accurately distinguish various lands objects. To ensure the effectiveness and reliability of the model we conduct 5 consecutive experiments. The average performance indices along their standard deviation for each model on the three datasets are presented in . In we show the classification map generated by each model on Salinas, Xuzhou and KSC datasets. Among these models proposed model demonstrated superior performance yielding more detailed and accurate classification result.

Table 7. Classification result (%) on xuzhou dataset.

Download CSV Display Table

Figure 4. Classification maps for Salinas Dataset (a) 2DCNN (b) 3DCNN (c) SSRN (d) RVT (e) ViT (f) CT mixer (g) SSFTT (h) Proposed.

It can be seen that proposed model achieved best, with overall accuracy (OA) reaching $99.75 %$ , $99.46 %$ and $99.95 %$ on three dataset. It can be seen from Salinas Dataset 2D CNN not suitable for small training samples. For Xuzhou dataset, spectral-spatial residual network (SSRN) and CT Mixer not obtain satisfactory results. For the KSC dataset 2D

and 3D model is Not provide a better result on small training samples with the use of 2D and 3D kernel size. In order to classification result, proposed model under the condition with small training samples using residual connection is more suitable for classification results. In order to evaluate the classification performance, show the classification maps obtained from the entire compared and proposed model on three-hyperspectral dataset using 10% and 5% training samples. The presence of noise in the classification maps is inversely related to the classification accuracy and this relationship is noteworthy. The proposed method exhibits a lower number of noise point compared to other methods, indicating its effectiveness in enhancing the classification performance Of HSI. The performance of the proposed model surpasses that the state of the art methods.

Table 8. Classification result (%) on KSC dataset.

Download CSV Display Table

Figure 5. Classification maps for KSC dataset (a) 2DCNN (b) 3DCNN (c) SSRN (d) RVT (e) ViT (f) CT mixer (g) SSFTT (h) proposed.

Figure 6. Classification maps for xuzhou dataset (a) 2DCNN (b) 3DCNN (c) SSRN (d) RVT (e) ViT (f) CT mixer (g) SSFTT (h) Proposed.

Discussions

This paper proposed a new model Hybrid Convolution Transformer for Hyperspectral image classification using 3D residual Convolutional neural network with Re- Attention module and vision transformer. That model can extract feature from small training samples and handle the long range dependencies successfully. One can see from that Convolutional neural network model and transformer model including 2DCNN, 3DCNN, SSRN, ViT, CT Mixer and SSFTT consider for comparison. All experiment results exhibit the proposed model can achieve best performance on all dataset. For hyperspectral image dataset with more redundant spectral bands and lower spatial resolution, spectral features demonstrate superiority over spatial features. On the other hand, for dataset with fewer redundant spectral bands and higher spatial resolution the scenario be opposite. The proposed model with Salinas Dataset improves $1.04 %$ , $0.92 %$ , $0.59 %$ , $0.14 %$ , $0.32 %$ , $2.08 %$ and $0.11 %$ on OA compared to 2DCNN, 3DCNN, SSRN, RVT, VIT, CT Mixer and SSFTT. At that time the proposed method achieve $0.54 %$ , $0.51 %$ , $0.54 %$ , $0.15 %$ , $1.04 %$ , $0.43 %$ on AA and $1.08 %$ $0.95 %$ , $0.58 %$ , $0.09 %$ , $0.29 %$ , $2.24 %$ and $0.06 %$ on Kappa. The classification accuracy of different classes is different because of imbalance of training samples and different classes have similarity with ground object. From Xuzhou Dataset with proposed method improved accuracy compare to stat of the art methods. Proposed method obtain $0.33 %$ , $0.51 %$ , $0.92 %$ , $1.48 %$ , $1.03 %$ , $0.48 %$ and $0.3 %$ OA accuracy. As well $0.2 %$ , $0.32 %$ , $0.98 %$ , $1.48 %$ , $1.14 %$ , $1.36 %$ and $0.5 %$ AA accuracy and $0.41 %$ , $0.64 %$ , $1.17 %$ , $1.33 %$ , $1.3 %$ , $0.6 %$ and $0.37 %$ improve on Kappa. shows the KSC classification result. The proposed model improve accuracy compared to other methods. Proposed method obtained $14.28 %$ , $4.45 %$ , $0.47 %$ , $2.92 %$ , $2.3 %$ , $8.51 %$ and $0.02 %$ OA accuracy. At mean time $15.66 %$ , $8.31 %$ , $0.69 %$ , $4.19 %$ , $4.41 %$ , $14.55 %$ , $0.07 %$ AA accuracy obtain and $15.77 %$ , $4.97 %$ , $3.26 %$ , $2.57 %$ , $9.49 %$ , $0.03 %$ Kappa accuracy obtain. All the methods obtain satisfactory result with limited training samples but there is inconsistency with class accuracy taking the example of Xuzhou dataset 3DCNN, SSRN and CT Mixer overall accuracy (OA) is not consistent. In terms of overall accuracy the proposed model exhibit the superior results. Despite the slightly decrease in accuracy for certain classes (<100%), the overall performance of the model remains consistently high. In conclusion, the evaluation of the proposed model on three dataset reveals some key findings: the proposed model achieves highest accuracy across all three datasets with accuracy for each class surpassing 90%. This demonstrate that the network optimization strategy not only enhance the accuracy of small samples classes but also maintains stability for other classes. Besides the proposed model performs well and the model utilizes 3D residual CNN for feature extraction and incorporate with attention module for spectral enhancement and Vision transformer for extracting high-level features these architectural components play a crucial role to extracting discriminative features for image classification.

Ablation Study

For the experiment four set of ablation study are conduct between Residual 3D-CNN, attention module and Vision transformer on KSC dataset. The Overall accuracy result show in . From the table one can see that overall accuracy is 97.80% when using 3D-CNN and attention module. However, for using 3D-CNN and vision transformer the overall accuracy increase $2.11 %$ . Vision transformer can Capture subtle spectral feature and produce competitive Result as compared to CNN based model.

Table 9. OA (%) of ablation Study.

Download CSV Display Table

However, when we apply attention module the overall accuracy (OA) increases simultaneously. Without attention module overall accuracy (OA) is $99.91 %$ and with attention mechanism OA is $99.96 %$ .

Impact of training samples

Evaluating the performance of models under different sample size is a crucial aspect of assessing model quality. Deep learning model typically demand a substantial amount of training data to achieve advanced learning capabilities. Unfortunately, obtaining a large number of samples in real life scenarios can be challenging and often limited. This scarcity of training data poses a significant challenge for deep learning models and highlights the importance of developing techniques that can perform effectively with smaller sample sizes. shows the test results of various methods on KSC dataset at different training ratios.

Figure 7. OA of 2–10% of training sample on KSC dataset.

As expected, as a training ratio increases, the accuracy of different hyperspectral image classification methods also raises. However, regardless of whether the training ratio is high or low, traditional methods such as 2DCNN and 3DCNN methods consistently yield lower accuracies compared to more recently proposed methods such as SSRN.

When the training ratio is high (e.g. 10%), there is little difference in the experimental results among methods. However, RVT and SSRN exhibits a significant drop in performance when training ratio is decreases. On the other hand, when the training ratio is less than 5% the proposed model demonstrate superior classification performance, indicating that hybrid convolution transformer model can effectively extract discriminative information even with limited training samples. The observation indirectly suggest that proposed model is capable of handling small sample scenarios effectively.

Conclusion

In order to effectively leverage both spectral and spatial features from limited training samples, we have introduced a hybrid convolutional transformer network for HSI. Our approach combines the residual 3DCNN model with channel attention module to extract joint spectral spatial features. The integration of vision transformer further enhance the model performance. In order to address the issue of overfitting we have incorporate sequence aggregation layer. Experimental result demonstrate the superior performance of our proposed model compared to other state of the art methods. For the future, we aim to refine the model architecture and develop a lightweight network for improved efficiency.

Acknowledgments

Author would like to thanks potential supervisor, editor and reviewer for their advice and comment.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research was funded by National Natural sciences foundation of china under Grant 62271171.

References

Ardouin, J. P., Lévesque, J., & Rea, T. A. (2007, July). A demonstration of hyperspectral image exploitation for military applications. In 2007 10th International Conference on Information Fusion (pp. 1–15). IEEE. https://doi.org/10.1109/ICIF.2007.4408184
Google Scholar
Bandos, T. V., Bruzzone, L., & Camps-Valls, G. (2009). Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3), 862–873. https://doi.org/10.1109/TGRS.2008.2005729
Web of Science ®Google Scholar
Bioucas-Dias, J. M., Plaza, A., Camps-Valls, G., Scheunders, P., Nasrabadi, N., & Chanussot, J. (2013). Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and Remote Sensing Magazine, 1(2), 6–36. https://doi.org/10.1109/MGRS.2013.2244672
Web of Science ®Google Scholar
Camps-Valls, G., & Bruzzone, L. (2005). Kernel-based methods for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 43(6), 1351–1362. https://doi.org/10.1109/TGRS.2005.846154
Web of Science ®Google Scholar
Chen, Y., Jiang, H., Li, C., Jia, X., & Ghamisi, P. (2016). Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10), 6232–6251. https://doi.org/10.1109/TGRS.2016.2584107
Web of Science ®Google Scholar
De Petris, S., Sarvia, F., & Borgogno-Mondino, E. (2023). Uncertainty assessment of sentinel-2-retrieved vegetation spectral indices over Europe. European Journal of Remote Sensing. https://doi.org/10.1080/22797254.2023.2267169
PubMed Web of Science ®Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021, Virtual Event, Austria, arXiv preprint arXiv:2010.11929.
Google Scholar
Feng, F., Wang, S., Wang, C., & Zhang, J. (2019). Learning deep hierarchical spatial–spectral features for hyperspectral image classification based on residual 3D-2D CNN. Sensors, 19(23), 5276. https://doi.org/10.3390/s19235276
PubMed Web of Science ®Google Scholar
Ganatra, N., & Patel, A. (2021). Deep learning methods and applications for precision agriculture. In Machine learning for predictive analysis. Lecture notes in networks and systems, Springer (Vol. 141). https://doi.org/10.1007/978-981-15-7106-0_51
Google Scholar
Ham, J., Chen, Y., Crawford, M. M., & Ghosh, J. (2005). Investigation of the random forest framework for classification of hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 492–501. https://doi.org/10.1109/TGRS.2004.842481
Web of Science ®Google Scholar
Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-D deep learning approach for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(8), 4420–4434. https://doi.org/10.1109/TGRS.2018.2818945
Web of Science ®Google Scholar
Hang, R., Liu, Q., Hong, D., & Ghamisi, P. (2019). Cascaded recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(8), 5384–5394. https://doi.org/10.1109/TGRS.2019.2899129
Web of Science ®Google Scholar
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., & Shi, H. (2021). Escaping the big data paradigm with compact transformers. CoRR, arXiv preprint arXiv:2104.05704.
Google Scholar
Haut, J., Paoletti, M., Paz-Gallardo, A., Plaza, J., Plaza, A., & Vigo-Aguiar, J. (2017, July). Cloud implementation of logistic regression for hyperspectral image classification. In Proc. 17th int. Conf. Comput. Math. Methods Sci. Eng.(cmmse) (Vol. 3, pp. 1063–2321). Costa Ballena (Rota). https://doi.org/10.1109/JMASS.2020.3019669
Google Scholar
Haut, J. M., Paoletti, M., Plaza, J., & Plaza, A. (2017). Cloud implementation of the K-means algorithm for hyperspectral image analysis. The Journal of Supercomputing, 73(1), 514–529. https://doi.org/10.1007/s11227-016-1896-3
Web of Science ®Google Scholar
He, X., Chen, Y., & Lin, Z. (2021). Spatial-spectral transformer for hyperspectral image classification. Remote Sensing, 13(3), 498. https://doi.org/10.3390/rs13030498
Web of Science ®Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., & Oh, S. J. (2021). Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada (pp. 11936–11945).
Google Scholar
Hong, D., Gao, L., Yao, J., Zhang, B., Plaza, A., & Chanussot, J. (2020). Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59(7), 5966–5978. https://doi.org/10.1109/TGRS.2020.3015157
Web of Science ®Google Scholar
Hong, D., Han, Z., Yao, J., Gao, L., Zhang, B., Plaza, A., & Chanussot, J. (2021). SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15. https://doi.org/10.1109/TGRS.2022.3172371
Web of Science ®Google Scholar
Hu, W., Huang, Y., Wei, L., Zhang, F., & Li, H. (2015). Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors, 2015, 1–12. https://doi.org/10.1155/2015/258619
Web of Science ®Google Scholar
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141). https://doi.org/10.1109/CVPR.2018.00745
Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
PubMed Web of Science ®Google Scholar
Liao, X., Liao, G., & Xiao, L. (2023). Rapeseed storage quality detection using hyperspectral image technology-an application for future smart cities. Journal of Testing and Evaluation, 51(3), 1740–1752. https://doi.org/10.1520/JTE20220073
Web of Science ®Google Scholar
Licciardi, G., Marpu, P. R., Chanussot, J., & Benediktsson, J. A. (2011). Linear versus nonlinear PCA for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geoscience and Remote Sensing Letters, 9(3), 447–451. https://doi.org/10.1109/LGRS.2011.2172185
Web of Science ®Google Scholar
Ma, K. Y., & Chang, C. I. (2021). Iterative training sampling coupled with active learning for semisupervised spectral–spatial hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59(10), 8672–8692. https://doi.org/10.1109/TGRS.2021.3053204
Web of Science ®Google Scholar
Melgani, F., & Bruzzone, L. (2004). Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing, 42(8), 1778–1790. https://doi.org/10.1109/TGRS.2004.831865
Web of Science ®Google Scholar
Özdemir, A. O. B., Gedik, B. E., & Çetin, C. Y. Y. (2014, June). Hyperspectral classification using stacked autoencoders with deep learning. In 2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Lausanne, Switzerland (pp. 1–4). IEEE.
Google Scholar
Paoletti, M. E., Haut, J. M., Fernandez-Beltran, R., Plaza, J., Plaza, A., Li, J., & Pla, F. (2018). Capsule networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(4), 2145–2160. https://doi.org/10.1109/TGRS.2018.2871782
Web of Science ®Google Scholar
Pirotti, F., Laurin, G., Vettore, A., Masiero, A., & Valentini, R. (2014). Small footprint full-waveform metrics contribution to the prediction of biomass in tropical forests. Remote Sensing, 6(10), 9576–9599. https://doi.org/10.3390/rs6109576
Web of Science ®Google Scholar
Qing, Y., Liu, W., Feng, L., & Gao, W. (2021). Improved transformer net for hyperspectral image classification. Remote Sensing, 13(11), 2216. https://doi.org/10.3390/rs13112216
Web of Science ®Google Scholar
Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2019). HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 17(2), 277–281. https://doi.org/10.1109/LGRS.2019.2918719
Web of Science ®Google Scholar
Santos, L., Santos, F. N., Oliveira, P. M., & Shinde, P. (2020). Deep learning applications in agriculture: A short review. Robot 2019: Fourth Iberian Robotics Conference. Advances in Intelligent Systems and Computing, Springer. vol 1092. https://doi.org/10.1007/978-3-030-35990-4_12
Google Scholar
Schimleck, L., Ma, T., Inagaki, T., & Tsuchikawa, S. (2022). Review of near infrared hyperspectral imaging applications related to wood and wood products. Applied Spectroscopy Reviews, 58(9), 1–25. https://doi.org/10.1080/05704928.2022.2098759
Web of Science ®Google Scholar
Sun, L., Zhao, G., Zheng, Y., Wu, Z., Ban, Y., Li, X., Zhang, B., & Plaza, A. (2022). Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–14. https://doi.org/10.1109/TGRS.2022.3144158
Web of Science ®Google Scholar
Uzkent, B., Rangnekar, A., & Hoffman, M. (2017). Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 39–48). https://doi.org/10.1109/CVPRW.2017.35
Google Scholar
Vaishnavi, B. B. S., Pamidighantam, A., Hema, A., & Syam, V. R. (2022, March). Hyperspectral image classification for agricultural applications. In 2022 International Conference on Electronics and Renewable Systems (ICEARS) (pp. 1–7). IEEE. https://doi.org/10.1109/ICEARS53579.2022.9751902
Google Scholar
Villa, A., Benediktsson, J. A., Chanussot, J., & Jutten, C. (2011). Hyperspectral image classification with independent component discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 49(12), 4865–4876. https://doi.org/10.1109/TGRS.2011.2153861
Web of Science ®Google Scholar
Wang, W., Dou, S., Jiang, Z., & Sun, L. (2018). A fast dense spectral–spatial convolution network framework for hyperspectral images classification. Remote Sensing, 10(7), 1068. https://doi.org/10.3390/rs10071068
Web of Science ®Google Scholar
Wang, J., Zhang, L., Tong, Q., & Sun, X. (2012, June). The spectral crust project—research on new mineral exploration technology. In 2012 4th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (pp. 1–4). IEEE. https://doi.org/10.1109/WHISPERS.2012.6874254
Google Scholar
Wu, H., & Prasad, S. (2017). Convolutional recurrent neural networks for hyperspectral data classification. Remote Sensing, 9(3), 298. https://doi.org/10.3390/rs9030298
Web of Science ®Google Scholar
Yang, J., Zhao, Y. Q., & Chan, J. C. W. (2017). Learning and transferring deep joint spectral–spatial features for hyperspectral classification. IEEE Transactions on Geoscience and Remote Sensing, 55(8), 4729–4742. https://doi.org/10.1109/TGRS.2017.2698503
Web of Science ®Google Scholar
Yao, X., Han, J., Cheng, G., Qian, X., & Guo, L. (2016). Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Transactions on Geoscience and Remote Sensing, 54(6), 3660–3671. https://doi.org/10.1109/TGRS.2016.2523563
Web of Science ®Google Scholar
Zhang, J., Meng, Z., Zhao, F., Liu, H., & Chang, Z. (2022). Convolution transformer mixer for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2022.3208935
Web of Science ®Google Scholar
Zhao, W., & Du, S. (2016). Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Transactions on Geoscience and Remote Sensing, 54(8), 4544–4554. https://doi.org/10.1109/TGRS.2016.2543748
Web of Science ®Google Scholar
Zhong, Z., Li, J., Clausi, D. A., & Wong, A. (2019). Generative adversarial networks and conditional random fields for hyperspectral image classification. IEEE Transactions on Cybernetics, 50(7), 3318–3329. https://doi.org/10.1109/TCYB.2019.2915094
PubMed Web of Science ®Google Scholar
Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2017). Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Transactions on Geoscience and Remote Sensing, 56(2), 847–858. https://doi.org/10.1109/TGRS.2017.2755542
Web of Science ®Google Scholar
Zhu, L., Chen, Y., Ghamisi, P., & Benediktsson, J. A. (2018). Generative adversarial networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(9), 5046–5063. https://doi.org/10.1109/TGRS.2018.2805286
Web of Science ®Google Scholar

A hybrid convolution transformer for hyperspectral image classification

ABSTRACT

Introduction