701
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Objective evaluation-based efficient learning framework for hyperspectral image classification

, , , , & ORCID Icon
Article: 2225273 | Received 28 Jan 2023, Accepted 10 Jun 2023, Published online: 17 Jun 2023

ABSTRACT

Deep learning techniques with remarkable performance have been successfully applied to hyperspectral image (HSI) classification. Due to the limited availability of training data, earlier studies primarily adopted the patch-based classification framework, which divides images into overlapping patches for training and testing. However, this framework results in redundant computations and possible information leakage. This study proposes an objective evaluation-based efficient learning framework for HSI classification. It consists of two main parts: (i) a leakage-free balanced sampling strategy and (ii) an efficient fully convolutional network (EfficientFCN) optimized for the accuracy-efficiency trade-off. The leakage-free balanced sampling strategy first generates balanced and non-overlapping training and test data by partitioning the HSI and its ground truth image into non-overlapping windows. Then, the generated training and test data are used to train and test the proposed EfficientFCN. EfficientFCN exhibits a pixel-to-pixel architecture with modifications for faster inference speed and improved parameter efficiency. Experimental results demonstrate that the proposed sampling strategy can provide objective performance evaluation. EfficientFCN outperforms many state-of-the-art approaches concerning the speed-accuracy trade-off. For instance, compared to the recent efficient models EfficientNetV2 and ConvNeXt, EfficientFCN achieves 0.92% and 3.42% superior accuracy and 0.19s and 0.16s faster inference time, respectively, on the Houston dataset. Code is available at https://github.com/xmzhang2018.

1. Introduction

Hyperspectral images (HSIs) contain hundreds of narrow bands spanning from the visible to the infrared spectrum, forming a 3-D hypercube. With abundant spectral information, each material possesses a specific spectral signature, like a unique fingerprint, serving as its identification. Because of their strong representability, HSIs have become economical, rapid, and promising tools for various applications, such as medical imaging (Mok and Chung Citation2020), environmental monitoring (Stuart et al. Citation2019), and urban development observation (Alamús et al. Citation2017). Semantic segmentation (also called pixel-level classification) is one of the most fundamental tasks for these applications.

Many HSI classification methods have been developed over the past few decades. Earlier approaches mainly focused on spectral information mining using machine learning methods, including unsupervised algorithms (e.g. clustering (Haut et al., Citation2017)) and supervised algorithms (e.g. support vector machines (Cortes and Vapnik Citation1995) and random forest (Breiman Citation2001)). Unsupervised algorithms do not rely on labeled data; however, supervised algorithms are generally preferred because of their superior performance. Nevertheless, the inherent high dimensionality and nonlinearity of HSIs limit the performance of supervised algorithms, especially when labeled samples are limited. Several dimensionality reduction techniques, such as band selection (Paul et al. Citation2015), feature selection (Quan et al. Citation2023), and manifold learning (Huang et al. Citation2015), have been introduced to project hypercube data into lower-dimensional subspaces by capturing essential information in HSIs. Given the spectral heterogeneity and complex spatial distribution of objects, spatial feature mining has attracted considerable attention (Gao and Lim Citation2019). Spatial feature extraction methods, such as gray-level co-occurrence matrix (Pesaresi, Gerhardinger, and Kayitakire Citation2008), guided filtering (Wang et al. Citation2018), and morphological operators (Bao et al. Citation2016), have been employed to extract spatial features of HSIs. Other studies adopted kernel-based methods (Lin and Yan Citation2016), 3-D wavelets (Cao et al. Citation2017; Tang, Lu, and Yuan Citation2015), and 3-D Gabor filters (Jia et al. Citation2018) to learn the joint spectral – spatial information for better classification. Although these traditional methods have achieved considerable progress, they are limited to shallow features and prior knowledge, resulting in poor robustness and generalization.

Deep learning (DL) can automatically learn high-level representations, overcoming the limitations of traditional feature extraction methods. It has achieved high performance in many challenging tasks, including object detection (Hou et al. Citation2019), scene segmentation (Fu et al. Citation2019), and image classification (Tan and Le Citation2021). Subsequently, various DL techniques have been adopted for HSI classification. A multilayer perceptron was designed as an encoder – decoder structure to extract the deep semantic information of HSIs (Lin et al. Citation2022). Chen (Chen, Zhao, and Jia Citation2015) introduced a deep belief network to HSI classification and designed three architectures based on this network for spectral, spatial, and spectral – spatial feature extraction. In (Hao et al. Citation2018), a stacked autoencoder and a convolutional neural network (CNN) were employed to encode spectral and spatial features, respectively, which were then fused for classification. Recurrent neural networks (RNNs) (Mou, Ghamisi, and Zhu Citation2017) and long short-term memory (LSTM) (Xu et al. Citation2018) have been applied to analyze hyperspectral pixels as sequential data. Moreover, graph convolutional networks have been employed to model long-range spatial relationships of HSIs because they can handle graph-structured data by modeling topological relationships between samples (Jiang, Ma, and Liu Citation2022). In (He, Chen, and Lin Citation2021; Hong et al. Citation2022; Sun et al. Citation2022), transformers were introduced to capture long-range sequence spectra in HSI. Among these DL algorithms, CNNs generally outperform the others in HSI classification because of their ability and flexibility to aggregate spectral and spatial contextual information (Sothe et al. Citation2020). The properties of local connections and shared weights allow CNNs to achieve higher accuracy with fewer parameters.

Many CNN-based methods have been proposed for HSI classification, including patch-based classification and fully convolutional network (FCN)-based segmentation. Previous studies (Paoletti et al. Citation2018; Zhang et al. Citation2021) mainly focused on patch-based classification, which assigns the category of a pixel by extracting features from the spatial patch centered on this pixel. However, redundant computation is inevitable using this method because overlap occurs between adjacent patches, as shown in . Many FCN-based approaches (Wang et al. Citation2021; Xu, Du, and Zhang Citation2020; Zheng et al. Citation2020) have been proposed to reduce computational complexity. They feed the initial HSI cube into the network, perform pixel-to-pixel classification, and output the entire classification map. Compared to patch-based classification, FCN-based segmentation usually produces competitive or superior results with less inference time.

Figure 1. Demonstration of the traditional sampling strategy, which results in (a) overlap between adjacent patches, (b) overlap between the training and test data, and (c) blurred boundaries of the classification map. In (a), dots represent the central pixels of the corresponding patches with white borders. In (b), green and red dots represent the training and test pixels of the corresponding patches, respectively.

Figure 1. Demonstration of the traditional sampling strategy, which results in (a) overlap between adjacent patches, (b) overlap between the training and test data, and (c) blurred boundaries of the classification map. In (a), dots represent the central pixels of the corresponding patches with white borders. In (b), green and red dots represent the training and test pixels of the corresponding patches, respectively.

However, unlike computer vision datasets containing thousands of labeled images, HSI datasets often include only one partially labeled image. Almost all of the aforementioned methods employ the random sampling strategy, where the training and test samples are randomly selected from the same image, resulting in the feature extraction space of the training and test data overlapping, as shown in . Consequently, in the training stage, information from the test data is used to train the network, leading to exaggerated results (Liang et al. Citation2017). Similarly, the existing FCN-based approaches that take the same entire HSI as input for training and testing also lead to higher training – test information leakage. Therefore, their performance and generalizability results are questionable because they violate the fundamental assumption of supervised learning (Liang et al. Citation2017). Although several new sampling strategies (Liang et al. Citation2017; Zou et al. Citation2020) have been proposed to avoid training – test information leakage, other limitations may emerge, e.g. imbalanced sampling results in certain categories for which all data are selected as the test or training set. In addition, the existing FCN-based segmentation networks that take an entire HSI as input result in significant memory consumption and limited batch size, dramatically slowing down the training speed.

To address these limitations, we propose an objective evaluation-based efficient learning (OEEL) framework for HSI classification and objective performance evaluation. First, to ensure balanced sampling and no training-test information leakage, a leakage-free balanced sampling strategy is proposed to generate training and test samples. Then, the EfficientFCN is designed to learn discriminative spectral – spatial features from the generated samples for effective and efficient data classification. Therefore, the proposed framework not only ensures that the feature extraction spaces of the training and test data are independent of each other, but also improves the classification accuracy and efficiency.

The main contributions of this study are summarized as follows:

  1. The OEEL framework is proposed for HSI classification to achieve fast classification and objective evaluation. This framework includes a sampling strategy and an FCN-based architecture.

  2. Since the existing sampling strategies are unsuitable for limited-label HSI datasets, four principles that should be considered for effective sampling strategy design are proposed. As per these principles, we design a leakage-free balanced sampling strategy for HSI datasets to generate balanced training and test data without information leakage, ensuring objective and accurate evaluation.

  3. A EfficientFCN architecture with high inference and parameter efficiency is developed to promote real world applications. Experiments on four publicly available datasets demonstrate that the proposed architecture is faster and more accurate than previous methods.

The remainder of this study is organized as follows. In Section 2, we review patch-based classification, FCN-based segmentation, and the sampling strategy for HSI datasets. Then, a detailed elaboration of the proposed OEEL framework is provided in Section 3, and the experiments and results are presented in Section 4. A series of ablation experiments and comparative experiments are conducted and discussed in Section 5. Finally, some concluding remarks are given in Section 6.

2. Related work

2.1 Patch-based classification

Most previous studies (Paoletti et al. Citation2018; Zhang et al. Citation2021) employed the patch-based classification framework to facilitate feature extraction and classifier training. An end-to-end network takes 3-D patches as input and outputs a specific label for each patch in its last fully connected (FC) layer (Paoletti et al. Citation2018). Another end-to-end 2-D CNN (Yu, Jia, and Xu Citation2017) uses 1 × 1 convolutional kernels to mine spectral information and uses global average pooling to replace FC layers to prevent overfitting. Santara (Santara et al. Citation2017) proposed a band-adaptive spectral – spatial feature learning neural network to address the curse of dimensionality and spatial variability of spectral signatures. It divides 3-D patches into sub-cubes along the channel dimension to extract band-specific spectral – spatial features. To enhance the learning efficiency and prevent overfitting, a deeper and wider network with residual learning was proposed (Lee and Kwon Citation2017), which employs a multi-scale filter bank to jointly exploit spectral – spatial information.

Two-branch CNN-based architectures (Hao et al. Citation2018; Liang et al. Citation2017; Xu et al. Citation2018) employ 2-D CNNs and other algorithms (e.g. 1-D CNN, stacked autoencoder, and LSTM) to encode spatial and spectral information, respectively, and then fuse the outputs for classification. Another type of spectral – spatial-based CNN architecture employs 3-D CNN to extract joint spectral – spatial features for HSI classification (Li, Zhang, and Shen Citation2017; Paoletti et al. Citation2018). For instance, the spectral – spatial residual network (SSRN) (Zhong et al. Citation2018) uses spectral and spatial residual blocks consecutively to learn spectral and spatial information from raw 3-D patches. A fast, dense spectral – spatial convolution framework (Wang et al. Citation2018) uses residual blocks with 1×1×dd>1and a×a×1a>1 convolution kernels to learn spectral and spatial information sequentially.

Recently, attention mechanisms have been introduced to adaptively emphasize informative features (Zhang et al. Citation2021). The squeeze-and-excitation (SE) module (Hu, Shen, and Sun Citation2018), which uses global pooling and FC layers to generate channel attention vectors, was adopted in (Fang et al. Citation2019; Huang et al. Citation2020) to recalibrate spectral feature responses. The convolutional block attention module (Woo et al. Citation2018) was adopted in (Zhu et al. Citation2020), where the spatial branch appends a spatial-wise attention module while the spectral branch appends a channel-wise attention module to extract spectral and spatial features in parallel. Similarly, the position self-attention module and the channel self-attention module proposed in (Fu et al. Citation2019) were introduced into a double-branch dual-attention mechanism network (DBDA) (Li et al. Citation2020) to refine the extracted features of HSIs. In (Zhang et al. Citation2021), a spatial self-attention module was designed for patch-based CNNs to enhance the spatial feature representation related to the center pixel.

Although the above patch-based classification methods achieved high performance, it is unclear whether this is attributed to the improved performance of the methods or the training – test information leakage (Liang et al. Citation2017). Furthermore, redundant computation of overlapping regions of adjacent patches is inevitable in these methods.

2.2 FCN-based segmentation

Many FCN-based frameworks have been developed to mitigate redundant computation caused by overlap between adjacent patches. The spectral – spatial fully convolutional network (SSFCN) (Xu, Du, and Zhang Citation2020) takes the original HSI cube as input and performs classification in an end-to-end, pixel-to-pixel manner. A deep FCN with an efficient nonlocal module (Shen et al. Citation2021) was proposed that takes an entire HSI as input and uses an efficient nonlocal module to capture long-range contextual information. To exploit global spatial information, Zheng et al. (Zheng et al. Citation2020) proposed a fast patch-free global learning framework that includes a global stochastic stratified sampling strategy and an encoder – decoder-based FCN (FreeNet). However, this framework does not perform well with imbalanced sample data. A spectral – spatial dependent global learning (SSDGL) framework (Zhu et al. Citation2021) was developed to handle imbalanced and insufficient HSI data.

Although these FCN-based frameworks alleviate redundant computation and achieve significant performance gains, they may lead to higher training – test information leakage. This is because they use the same image for both training and testing, thus leading to overlap and interaction between the feature extraction spaces of the training and test data.

2.3 Sampling strategy

The aforementioned training – test information leakage not only leads to a biased evaluation of spatial classification methods but may also distort the boundaries of objects, as shown in . Therefore, the pixel-based random sampling strategy inadvertently affects feature learning and performance evaluation.

Several new sampling strategies have been proposed to address these limitations. A controlled random sampling strategy was designed to reduce overlap between training and test samples (Liang et al. Citation2017). Specifically, this strategy randomly selects a labeled pixel from each unconnected partition as a seed and then extends the region from the seed pixel to generate training data. Finally, pixels in the grown regions are selected as training data, and the remaining pixels are selected as test data. This sampling strategy dramatically reduces the overlap between training and test data, but it cannot eliminate it because pixels at the boundaries of each training region still overlap with test data. Nalepa et al. (Nalepa, Myller, and Kawulok Citation2019) proposed to divide the HSI into fixed-size patches without overlapping and then randomly select some patches as the training set. The method proposed in (Zou et al. Citation2020) only selects training samples from multi-class blocks following a specific order. Nevertheless, both methods may suffer from a severe sample imbalance, i.e. there may be certain categories for which all data are selected as the test or training set. The former causes the trained model to fail to recognize these categories, while the latter results in a lack of test samples for evaluation. Furthermore, these methods disregard boundary pixels, where a patch cannot be defined. Therefore, the significant loss of samples together with the scarcity of training samples can cause overfitting.

3. Method

This section presents the OEEL framework. As shown in , it comprises two main steps. First, the proposed leakage-free balanced sampling strategy divides the HSI cube into non-overlapping training and test data. Second, the generated training and test data are used to train and test the proposed EfficientFCN for future extraction and data classification. The relevant details of both steps are described below.

Figure 2. Overview of the proposed OEEL framework. The framework includes two core components: a leakage-free balanced sampling strategy and a EfficientFCN.

Figure 2. Overview of the proposed OEEL framework. The framework includes two core components: a leakage-free balanced sampling strategy and a EfficientFCN.

3.1 Leakage-free balanced sampling strategy

As discussed in Section 2.3, the commonly used sampling strategy exaggerates the classification results because of training – test information leakage. Although several new sampling strategies have been proposed to address this problem, other limitations may emerge. Based on these observations and empirical studies (Liang et al. Citation2017; Zou et al. Citation2020), we derived four basic principles for effective sampling strategy design: P1) balanced sampling to ensure that all categories are present in both training and test sets; P2) samples should be maximally utilized; P3) regions that contribute to feature extraction from training data cannot be used for testing to satisfy the independence assumption; and P4) random sampling to avoid biased estimates.

As per these principles, we designed a leakage-free balanced sampling strategy, as shown in . Since many spatial-based methods require square patches as input, the HSI and its ground truth need to be divided into square windows of equal size. To satisfy P1, the window size should ensure each class in at least two windows, and there is a trade-off between the window size and the number of windows. We will mirror the pixels on the right and bottom borders outward if the width and height of the image cannot be divided by the window size, as shown in the first step of . This step allows all the border pixels to be fed into the network and used as any other pixels in the image. Once the border pixels are mirrored, the HSI and its ground truth are split in disjoint windows.

Figure 3. Flowchart of the proposed leakage-free balanced sampling strategy. The operation process of an HSI is the same as that of the ground truth. For convenience, only the ground truth operation process is presented.

Figure 3. Flowchart of the proposed leakage-free balanced sampling strategy. The operation process of an HSI is the same as that of the ground truth. For convenience, only the ground truth operation process is presented.

The next step is to divide these windows into training and test windows according to a predefined order to satisfy P1 and P3–4. The predefined order can be either by category or by the number of samples within each category. Here, we perform window-based random sampling within each category in order. As shown in the dotted box of , windows containing the first class are collected; then, a predefined proportion of windows are randomly selected for training while the remaining windows are used for testing (P4). To satisfy P3, the corresponding positions of windows containing the first class are set to zero in the HSI and its ground truth, which are then used to collect the windows that contain the next category. This process is repeated until sampling is complete for all categories. Note that each window is selected only once to avoid repeat sampling. These windows are only selected as training or test windows and the pixel categories within each window are independent of each other.

It is necessary to perform data augmentation to avoid overfitting due to the limited training windows. As performed in most previous studies (Xu et al. Citation2018; Zhang et al. Citation2021), each training window is randomly rotated between 0° and 360° and horizontally or vertically flipped. We add noise or change the brightness of training windows to enhance the robustness of approaches under various conditions such as different sensors, light changes, and atmospheric interference.

A summary of the proposed sampling strategy is provided in Algorithm 1. It follows all of the abovementioned principles, enabling the accurate and objective performance evaluation of approaches.

3.2 EfficientFCN

The prior works mainly sought to make very deep models converge with reasonable accuracy, or design complicated models to achieve better performance. Consequently, the resultant models were neither simple nor practical, hence limiting real world applications. Therefore, this subsection proposes a EfficientFCN, which is optimized for faster inference speed and higher parameter efficiency. It includes two main blocks – the efficient feature extraction (EFE) block and the fused efficient feature extraction (fused EFE) block – which are described as follows.

3.2.1 EFE block

Because the depthwise convolution (Chollet Citation2017) has fewer parameters and floating-point operations (FLOPs) than regular convolutions, it was introduced into MBConv (Tan and Le Citation2021) to achieve higher parameter efficiency. MBConv is defined by a 1 × 1 expansion convolution followed by 3 × 3 depthwise convolutions, an SE module, and a 1 × 1 projection layer. Its input and output are connected by a residual connection when they have the same number of channels. MBConv attaches batch normalization (BN) and a sigmoid linear unit (SiLU) activation function to each convolutional layer.

To improve the network efficiency, we first replace SiLU with the scaled exponential linear unit (SELU). SELU exhibits self-normalizing properties, which are faster than external normalization, confirming that the network converges faster. The SELU activation function is defined as:

(1) SELUx=λxαexαififx>0x0,(1)

where x is the input, α and λ(λ >1) are hyperparameters, and e denotes the exponent. SELU reduces the variance for negative inputs and increases that for positive inputs, thereby preventing vanishing and exploding gradients. Moreover, it produces outputs with zero mean and unit variance. Therefore, SELU converges faster and more accurately than SiLU, leading to better generalization (Madasu and Rao Vijjini Citation2019).

Layer normalization (LN) has been used in ConvNeXt (Liu et al. Citation2022) and slightly outperformed BN in various application scenarios. Following the same optimization strategy as (Liu et al. Citation2022), we substitute BN with LN in our network.

Considering that LN and activation function operations take considerable time (Ma et al. Citation2018), ConvNeXt uses fewer LN and activation functions and achieves better results. Therefore, we also use fewer LN and SELU activation functions to improve accuracy and efficiency. As shown in , the LN and activation function are attached only after the expansion convolution and depthwise convolution, respectively. Furthermore, the SE module is removed due to the high computational cost of FC layers in SE. The results in Section 5.2 demonstrate that this modification not only improves training speed and parameter efficiency, but also improves classification performance.

Figure 4. EfficientFCN architecture designed for HSI classification. (a) EFE block. (b) Fused EFE block. (c) EfficientFCN embedded with EFE blocks and fused EFE blocks, where the number of output channels and repetitions per block is listed on the left and right sides, respectively.

Figure 4. EfficientFCN architecture designed for HSI classification. (a) EFE block. (b) Fused EFE block. (c) EfficientFCN embedded with EFE blocks and fused EFE blocks, where the number of output channels and repetitions per block is listed on the left and right sides, respectively.

shows the detailed architecture of the EFE block. It comprises an expansion convolution with LN, followed by 3 × 3 depthwise convolutions with the SELU activation function and a 1 × 1 projection layer. The expansion ratio of the first 1 × 1 convolution is then set to 2. Similarly, the input and output of the EFE block are connected via a residual connection when they have the same number of channels.

3.2.2 Fused EFE block

Since depthwise convolutions cannot fully utilize modern accelerators, Fused-MBConv replaces the 3 × 3 depthwise convolutions and 1 × 1 expansion convolution in MBConv with a single regular 3 × 3 convolution (Tan and Le Citation2021). We follow the operation of Fused-MBConv to replace the 1 × 1 expansion convolution and 3 × 3 depthwise convolutions in the EFE block with a single regular 3 × 3 convolution to improve the training speed, as shown in . Similarly, LN and SELU are only appended after the 3 × 3 convolution and 1 × 1 convolution, respectively. Similar to the EFE block, the expansion ratio of the first 1 × 1 convolution is set to 2.

3.2.3 Efficiency FCN

It has been demonstrated that depthwise convolutions are slow at the early stages but effective in deep layers (Tan and Le Citation2021). Thus, the EFE block is placed in deep layers. After incorporating the EFE and fused EFE blocks in the network, the EfficientFCN architecture can be developed, as shown in , where the number of repetitions and output channels is presented to the left and right of each block, respectively. The network aims to learn a mapping of Xinh×w×BYinh×w×K for classification, where h×w and B are the spatial size and the number of bands of X, respectively, and K is the number of categories to be classified.

In our network, the number of channels starts at the maximum value and decreases as the layer deepens. We refer to this operation as inverted channels. HSIs with abundant spectral information inevitably contain a high degree of redundancy between bands. Inverted channels can allow the network to learn additional valuable information from redundant bands.

There are no pooling layers throughout the network. The reasons for this are mainly twofold. First, pooling operations are performed on aggregated rather than positioned features, making the network more invariant to spatial transformations. Spatial invariance, in turn, limits the accuracy of semantic segmentation. Second, pooling operations are primarily used to reduce computational complexity by reducing the spatial dimensions of feature maps. This operation results in a significant loss of spatial information and may blur land cover boundaries, especially when the input size is small. Moreover, our task is pixel-wise classification. The network outputs should have the same spatial dimension as the input. Therefore, we do not perform any downsampling operations. Note that our EfficientFCN still maintains the capability to process images with arbitrary spatial sizes. Extracting patches and sending them to the network to generate the final full classification map has two main reasons: 1) it ensures that the feature extraction spaces of the training and test data are independent of each other, and 2) smaller input sizes lead to fewer computations and allows for large batch sizes, thus improving training speed.

After the EfficientFCN is constructed, its parameters are initialized and trained end to end. The performance of the proposed FCN is presented in Section 4.

4. Experiments

This section describes the experimental datasets and settings, including comparison methods, evaluation metrics, and parameter settings. Quantitative and qualitative analyses of the experimental results are also presented.

4.1 Description of datasets

We conducted experiments on four datasets of different sizes: Indian Pines (IP), Pavia University (PU), Salinas (SA), and University of Houston (UH).

The IP dataset was collected in 1992 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over northwestern Indiana, USA, which is an agricultural area with irregular forest regions and crops of regular geometry. The dataset has 145 × 145 pixels with a spatial resolution of 20 m. Each pixel has 224 spectral bands ranging from 0.4 to 2.5 µm. After discarding 24 noise and water absorption bands, 200 bands were used for classification. The ground truth has 16 land cover classes. summarizes the class name and number of samples. The spatial distribution of training data is provided in , which was produced using the proposed sampling strategy.

Figure 5. IP dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

Figure 5. IP dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

The PU dataset covering the University of Pavia, Northern Italy, was collected by the Reflective Optics System Imaging Spectrometer sensor in 2001. The dataset is a 610 × 340 × 115 data cube with a spatial resolution of 1.3 m and a wavelength range of 0.43–0.86 µm. Before the experiments, the number of spectral bands was reduced to 103 by removing water absorption bands. The scene is an urban environment characterized by natural objects and shadows, where nine classes of land-cover are labeled. Detailed information about this dataset is provided in .

Figure 6. PU dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

Figure 6. PU dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

The SA dataset was recorded by the AVIRIS sensor over several agricultural fields in Salinas Valley, California, USA. It contains 512 × 217 pixels with a spatial resolution of 3.7 m per pixel. Each pixel has 224 spectral bands in the spectral range of 0.36–2.5 µm. As in the case of the IP dataset, 20 noise and water absorption bands were discarded before the experiments. As summarized in , 16 land-cover classes were defined. shows the spatial distribution of training data.

Figure 7. SA dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

Figure 7. SA dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

The UH dataset covers an urban area that includes the University of Houston campus and neighboring areas. It was collected by the National Center for Airborne Laser Mapping in June 2012. It has 144 spectral bands in the wavelength range of 0.38–1.05 µm. Furthermore, the spatial dimension and resolution of this scene are 349 × 1905 and 2.5 m, respectively. There are 15 classes in this scene, and detailed information about this dataset is presented in .

Figure 8. UH dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

Figure 8. UH dataset. (a) Land cover type and sample settings. (b) Spatial distribution of training samples (white windows).

Before these experiments, we normalized these datasets to [−1, 1] to unify the data magnitude to promote network convergence.

4.2 Experimental settings

We compared the performance of the proposed network with that of state-of-the-art DL architectures, including SSRN (Zhong et al. Citation2018), DBDA (Li et al. Citation2020), spectral – spatial 3-D fully convolutional network (SS3FCN) (Zou et al. Citation2020), FreeNet (Zheng et al. Citation2020), SSDGL (Zhu et al. Citation2021), ConvNeXt (Liu et al. Citation2022), and EfficientNetV2 (Tan and Le Citation2021). Both SSRN and DBDA are patch-based 3-D CNN networks. SSRN uses consecutive spectral and spatial residual blocks to learn spectral and spatial representations, respectively, followed by an average pooling layer and an FC layer. DBDA includes a dense spectral branch with the channel attention module and a dense spatial branch with the position attention module. The outputs of both branches are concatenated and fed to an average pooling layer, followed by an FC layer for classification. SS3FCN considers small patches of the original HSI as input and performs pixel-to-pixel classification, where parallel 3-D and 1-D FCNs are used to learn joint spectral – spatial features and spectral features, respectively. FreeNet and SSDGL are both encoder – decoder-based FCN architectures designed to exploit global discriminative information. FreeNet uses a spectral attention-based encoder, while SSDGL uses global convolutional LSTM and a joint attention mechanism to capture long-range dependencies. EfficientNetV2 and ConvNeXt are state-of-the-art backbones in computer vision. EfficientNetV2 has higher parameter efficiency and faster training speed, while ConvNeXt has comparable performance to transformers.

There are many parameters related to DL architectures. In EfficientFCN, the convolutional stride and spatial padding size are set to 1, while the dropout rate is set to 0.2. Other hyperparameters are presented in . These hyperparameters can be adjusted per different situations. For example, the number of output channels can be halved for the PU dataset with fewer channels. The above hyperparameter setting was used for all four datasets in the following experiments for a fair comparison. The proposed network adopted the AdamW optimizer (Loshchilov and Hutter Citation2019), where the learning rate, weight decay, and the number of training epochs were set to 1 × 10−4, 1 × 10−2, and 150, respectively. The hyperparameters of the comparison methods were set according to the recommended values and then selected after fine-tuning to achieve the best performance. For EfficientNetV2 and ConvNeXt, we adopted their minimum model settings (i.e. EfficientNetV2-S and ConvNeXt-T) and reduced their number of stages and layers to an equal proportion to keep their total number of stages and layers the same as our EfficientFCN. All methods were run on the PyTorch platform and were trained and tested on the same sample sets generated by the proposed sampling strategy. The batch size was set to 64 for all methods. Furthermore, all experiments were conducted on a workstation with an AMD Ryzen 7 5800 × 8-core processor with a 3.40 GHz CPU and NVIDIA GeForce RTX 3060 GPU.

Classification performance was evaluated by producer accuracy (PA) of each class, overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa). All experiments were repeated 10 times to avoid biased estimation, and mean values were calculated for comparison, as presented in Section 4.3.

4.3 Experimental results

4.3.1 Quantitative evaluation

summarize the classification accuracy of all compared methods. From these tables, we can observe that the performance of all methods was considerably lower on the IP dataset than on the other datasets, especially for the 4th, 9th, and 15th categories on the IP dataset. This may be due to the lack of training data and the low spatial resolution of this dataset. Nevertheless, on all four datasets, our network achieved the highest OA, AA, and Kappa and exhibited the best or near-best accuracy in most classes. For example, on the IP dataset, the proposed method obtained the highest OA of 84.72%, which exceeded that of SSRN, DBDA, SS3FCN, FreeNet, SSDGL, ConvNeXt, and EfficientNetV2 by ~7.04%, 6.35%, 8.05%, 2.96%, 2.65%, 5.54%, and 3.64%, respectively.

Table 1. Comparison of classification accuracy of different methods on the IP dataset.

Table 2. Comparison of classification accuracy of different methods on the PU dataset.

Table 3. Comparison of classification accuracy of different methods on the SA dataset.

Table 4. Comparison of classification accuracy of different methods on the UH dataset.

Although some comparison methods achieved satisfactory results in previous studies, they failed to perform well on certain datasets using the proposed sampling strategy. Among these methods, SS3FCN generally exhibited the worst performance since it uses a 3-D FCN and a 1-D FCN to learn spectral – spatial features and spectral features, respectively, resulting in high spectral redundancy and increased model complexity. Regarding FCN-based methods, FreeNet and SSDGL performed better on the IP dataset but worse on the other datasets. A possible reason for this is that the scarcity of labeled data makes it difficult for these methods to optimize, as these networks are more complex than others. Compared with FreeNet and SSDGL, the patch-based methods (i.e. SSRN and DBDA) performed worse on the IP dataset but better on the other three datasets. However, ConvNeXt and EfficientNetV2 performed well on all four datasets, indicating superior generalization performance. Note that the proposed network exhibited significant improvement over all of the above comparison methods on all four datasets, demonstrating its effectiveness and generalizability. The proposed network classified the corresponding test data with relatively high accuracy, even for certain indistinguishable classes (e.g. Gravel in the PU dataset and railways in the SA dataset). These results confirm the robustness of the designed network under challenging conditions.

4.3.2 Qualitative evaluation

visualize the corresponding classification maps alongside the false color images and ground truth maps. As can be seen, the classification maps are consistent with the reported quantitative results. For example, the classification maps produced by SS3FCN contained more noise and speckles than those produced by other methods on the IP, PU, and SA datasets, which is consistent with the quantitative results in . Among these methods, the proposed network produced the least noise and the most accurate classification maps on all datasets. In addition, objects covered by shadows could be identified using the proposed framework. As illustrated in the black rectangles in , parts of buildings, roads, and vegetation were covered in shadows. SS3FCN, EfficientNetV2, ConvNeXt, and the proposed network could detect shadow regions more effectively than SSRN, DBDA, FreeNet, and SSDGL.

Figure 9. Classification maps of different methods on the IP dataset. (a) False colour image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGLt. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN.

Figure 9. Classification maps of different methods on the IP dataset. (a) False colour image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGLt. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN.

Figure 10. Classification maps of different methods on the PU dataset. (a) False colour image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGL. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN..

Figure 10. Classification maps of different methods on the PU dataset. (a) False colour image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGL. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN..

Figure 11. Classification maps of different methods on the SA dataset. (a) False colour image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGLt. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN..

Figure 11. Classification maps of different methods on the SA dataset. (a) False colour image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGLt. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN..

Figure 12. Classification maps of different methods on the UH dataset. (a) False color image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGLt. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN.

Figure 12. Classification maps of different methods on the UH dataset. (a) False color image. (b) Ground truth image. (c) SSRN. (d) DBDA. (e) SS3FCN. (f) FreeNet. (g) SSDGLt. (h) EfficientNetV2. (i) ConvNeXt. (j) EfficientFCN.

Furthermore, using the proposed sampling strategy, the class boundaries of classification maps produced by the spectral – spatial methods are more consistent with those of false color images, especially for the IP dataset. However, there are many square-like noises in the classification maps. The reason behind this phenomenon lies in two main aspects, i.e. 1) the input window size was extremely small to provide sufficient spatial information, resulting in inconsistent segmentation across window boundaries, and 2) it is caused by window stitching. Therefore, selecting a larger window size is preferable if the basic principles of designing an effective sampling strategy are met, as described in Section 3.1. Furthermore, the overlay inference strategy (Zheng et al. Citation2021) can alleviate this problem.

In summary, the experimental results demonstrate the superiority of the proposed network and indicate that the performance of spectral – spatial methods can be more accurately reflected and evaluated using the proposed sampling strategy.

5. Discussion

5.1 Leakage-free balanced sampling strategy analysis

It can be seen from that there is no overlap between the training and test data and all classes are present in both training and test data, thus demonstrating that the proposed sampling strategy can avoid information leakage and achieve balanced sampling.

In addition, we observed a trade-off between the window size and the number of windows because overly small windows result in limited spatial information for spatial-based methods to learn. Conversely, excessively large windows caused certain classes with limited samples to be only in the training or test set. Therefore, we analyzed the effect of window size on the performance of the proposed EfficientFCN. Due to the limited number of labeled samples for specific classes in the IP dataset (e.g. the Oats category had only 20 labeled pixels), we set its window size to the minimum value of 4. For other datasets, we conducted experiments to select the optimal window size. We varied the window size during the experiments while fixing all other parameters.

Unlike patch-based classification, where accuracy improves as patch size increases, in our experiments, accuracy did not increase with increasing window size, as illustrated in . Accuracy even decreases with an increase in window size. Moreover, the difference in accuracy for different window sizes was minor, again demonstrating that the proposed sampling strategy can eliminate the spatial dependence between training and test data.

Figure 13. Variation of test accuracy with input window size on the IP dataset.

Figure 13. Variation of test accuracy with input window size on the IP dataset.

Although the smallest window size achieved the highest accuracy on certain datasets, it failed to provide sufficient spatial information for methods with strong spatial information extraction ability. Moreover, smaller window sizes resulted in lower inference efficiency and more scatter points (). Therefore, it is preferable to choose a larger window size with comparable accuracy. Weighting the efficiency and accuracy, we set the window size to 6 for the PU dataset, 9 for the SA dataset, and 9 for the UH dataset.

Our sampling strategy applies not only to HSI data, but also to other real-world remote sensing data, especially data with imbalanced categories. However, it is unsuitable for large datasets containing hundreds or thousands of labeled images, such as computer vision datasets.

5.2 EfficientFCN analysis

We then analyzed the proposed network design by following a trajectory from EfficientNetV2 to the EfficientFCN. Experiments were conducted on the IP and UH datasets, both of which are typical and challenging datasets. The corresponding results are summarized in .

Table 5. Ablation analysis of the proposed EfficientFCN on the IP and UH datasets.

The normalization layer and activation function are important components of a network. We first evaluated the performance of the proposed network with different activation functions, including SiLU, SELU, and GELU. SiLU and GELU are used in EfficientNetV2 (Tan and Le Citation2021) and ConvNeXt (Liu et al. Citation2022), respectively. SELU possesses self-normalizing properties that allow neural network learning to be highly robust. As shown in ), the network trained with GELU achieved the best results with an OA of 81.39% on the IP dataset and an OA of 89.56% on the UH dataset. Therefore, the proposed network adopted GELU as the activation function and used it in the following experiments.

For normalization, as illustrated in ), our network trained with LN obtained a slightly higher OA than BN, with gains of 3% and 1% on IP and UH datasets, respectively. Therefore, we use LN for normalization in our proposed network.

The number of activation and normalization layers also affects the performance of networks. As shown in ), after reducing the number of LN and SELU activation layers, the classification accuracy on both datasets did not decline but slightly improved. This may be because SELU induces self-normalizing properties. Therefore, it is not necessary to perform normalization again.

To avoid overfitting and reduce the number of trainable parameters, we used the channel attention module (Fu et al. Citation2019) with no trainable parameters to replace the SE module, which did not contribute to accuracy. We then tried to remove the attention module from our network. Interestingly, this led to marginal improvements on both datasets (from 82.47% to 82.72% on the IP dataset and from 89.81% to 89.84 on the UH dataset). Thus, our network did not contain an attention module.

As detailed in ), the inverted channels significantly increased the OA from 82.72% to 84.72% on the IP dataset and from 89.84% to 91.43% on the UH dataset. This demonstrates that the inverted channels setting can help the network excavate additional discriminative spectral information. Due to the inverted channels setting, our EfficientFCN only applies to HSI data and is unsuitable for data with fewer bands, such as multispectral data.

In addition, we replace the proposed EFE and Fused EFE blocks with normal convolution in our EfficientFCN, respectively, for further comparison. The corresponding results are summarized in . After replacing the EFE and Fused EFE blocks separately with normal convolution, the classification accuracies on the IP and UH datasets all decreased to varying degrees. This further demonstrates that EFE and Fused EFE blocks can consistently improve performance by enhancing the discriminative feature learning ability of networks.

Table 6. Effects of the EFE and Fused EFE blocks on the performance of the proposed EfficientFCN.

The modified network improved classification results for both datasets. The superior performance of our network is attributed to its better ability to capture valuable information from redundant spectral bands.

5.3 Model complexity and speed analysis

To comprehensively analyze the complexity of the proposed network, we calculated the number of trainable parameters (Params) and FLOPs as well as training (Trn) and inference (Infer) time for the comparison methods on the IP and UH datasets. Params and FLOPs are indirect measures of computational complexity, while runtime is a direct measure.

As shown in , the proposed method generally achieved the best results, especially in training and inference time, and near-best results in Params and FLOPs. SS3FCN, FreeNet, and SSDGL have more Params than other methods. Although SS3FCN has the fewest FLOPs, its training and inference time is the longest as it not only employed 3-D networks with many Params but also used a triple prediction averaging strategy. Compared to the patch-based methods (i.e. SSRN and DBDA), the FCN-based methods (except SS3FCN) took less time for inference. Note that the time-consuming training process is conducted offline, while the inference speed is the main factor determining whether a method is practical. Thus, the pixel-to-pixel classification strategy is more suitable for practical applications. The proposed network had the fastest inference speed among the comparison networks.

Table 7. Comparison of Params, FLOPs, training (abbreviated as Trn), and Inference (abbreviated as Infer) time of different methods on IP and UH datasets.

5.4 Impact of the number of training samples

To test the robustness and stability of the proposed method, we performed experiments with fewer training samples per class on the IP dataset. This dataset is a typical unbalanced dataset with extremely few labeled samples, thus posing significant challenges to supervised methods. shows the OA of different methods with different numbers of training samples, where the training percent represents the proportion of training samples in . For example, 100% is the total number of training samples, as listed in . For all methods, accuracy decreased with fewer training samples, especially when the training percent was < 50%. Nevertheless, the proposed network consistently outperformed the other methods in accuracy, demonstrating its robustness.

Figure 14. The classification accuracy of different methods with a varying number of training samples on the IP dataset.

Figure 14. The classification accuracy of different methods with a varying number of training samples on the IP dataset.

5.5 Extended experiments

For further performance assessment of the proposed method, we compared it with MobileNetV2, EfficientNetV2, spatial-spectral transformer (SST), and spectral – spatial feature tokenization transformer (SSFTT) on DFC2018 and Chikusei datasets. Specifically, MobileNetV2 and EfficientNetV2 are high-efficiency networks that are more parameter efficient and much faster for image recognition. SST and SSFTT are transformer-based networks designed for HSI classification. DFC2018 and Chikusei are two large real-world datasets and detailed information can be found in (Xu et al. Citation2019) and (Yokoya and Iwasaki Citation2016), respectively. We adopted the minimum model settings of EfficientNetV2 and MobileNetV2 and reduced their number of stages and layers in equal proportion to keep their total number of stages and layers the same as our EfficientFCN. The parameters of the SST and SSFTT are kept the same as the original paper. According to the dataset and memory size, we set the window size of the DFC2018 and Chikusei datasets to 32 × 32 and 48 × 48, respectively. The predefined proportion of training windows is set to 8% for both datasets. Again, AA, OA and Kappa are used for quantitative performance evaluation and the results are summarized in for comparison.

Table 8. Comparison among MobileNetV2, EfficientNetV2, SST, SSFTT and the proposed EfficientFCN on DFC2018 and Chikusei datasets.

shows that the proposed method yields the best results on both datasets. EfficientNetV2 and MobileNetV2 are superior to SST and SSFTT, confirming that EfficientNetV2 and MobileNetV2 have better generalization than SST and SSFTT. Although SST and SSFTT are transformer-based networks specifically designed for HSI classification, they still follow the patch-based classification framework. In the patch-based classification framework, a large patch size is effective in capturing spatial information for center pixel classification. However, a too large patch size results in the decreasing accuracy, mainly due to having pixels from other classes included for learning. The proposed EfficientFCN shows better results than EfficientNetV2 and MobileNetV2, demonstrating its superiority and generalizability. Although our EfficientFCN can extract information from a larger receptive field by stacking multiple layers, it still lacks global connectivity. Therefore, in the future, we will introduce the transformer to FCN-based networks to capture long-range dependencies in both spatial and spectral dimensions.

6. Conclusion

This study proposes an OEEL framework for HSI datasets to facilitate efficient classification and objective performance evaluation. In this framework, the proposed leakage-free balanced sampling strategy can generate balanced training and test samples without overlapping and information leakage, enabling objective performance evaluation. Based on the generated samples, the EfficientFCN is proposed to avoid redundant computation while exhibiting a favorable accuracy-speed trade-off. Both quantitative and qualitative experimental results show that the proposed EfficientFCN outperforms many state-of-the-art methods.

However, the experimental results in this study may fail to identify suitable DL-based architectures because the lack of HSI datasets prevents some of these architectures from realizing their full potential. Therefore, future work should construct large benchmark datasets to facilitate future research on HSI analysis. Furthermore, we will consider weakly supervised approaches to relieve the demand for expensive pixel-level image annotation.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 42101321 and 61922013; the Project funded by China Postdoctoral Science Foundation under Grant 2021M701653, the Major Special Project-the China High-Resolution Earth Observation System under Grant 30-Y30F06-9003-20/22.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data supporting the findings of this study are all available online. Indian Pines, Pavia University, and Salinas datasets can be downloaded from http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes. The University of Houston dataset can be downloaded from https://hyperspectral.ee.uh.edu/?page_id=459. The DFC2018 dataset can be downloaded from https://hyperspectral.ee.uh.edu/?page_id=1075. The Chikusei dataset can be downloaded from https://naotoyokoya.com/Download.html.

References

  • Alamús, R., S. Bará, J. Corbera, J. Escofet, V. Palà, L. Pipia, and A. Tardà. 2017. “Ground-Based Hyperspectral Analysis of the Urban Nightscape.” ISPRS Journal of Photogrammetry and Remote Sensing 124:16–22. https://doi.org/10.1016/j.isprsjprs.2016.12.004.
  • Bao, R., J. Xia, M. D. Mura, P. Du, J. Chanussot, and J. Ren. 2016. “Combining Morphological Attribute Profiles via an Ensemble Method for Hyperspectral Image Classification.” IEEE Geoscience and Remote Sensing Letters 13 (3): 359–363. https://doi.org/10.1109/LGRS.2015.2513002.
  • Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
  • Cao, X., L. Xu, D. Meng, Q. Zhao, and Z. Xu. 2017. “Integration of 3-Dimensional Discrete Wavelet Transform and Markov Random Field for Hyperspectral Image Classification.” Neurocomputing 226:90–100. https://doi.org/10.1016/j.neucom.2016.11.034.
  • Chen, Y., X. Zhao, and X. Jia. 2015. “Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 8 (6): 2381–2392. https://doi.org/10.1109/JSTARS.2015.2388577.
  • Chollet, F. 2017. Xception: Deep Learning with Depthwise Separable Convolutions, 2017, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA. 1800–1807. https://doi.org/10.1109/CVPR.2017.195.
  • Cortes, C., and V. Vapnik. 1995. “Support-Vector Networks.” Machine Learning 20 (3): 273–297. https://doi.org/10.1007/BF00994018.
  • Fang, B., Y. Li, H. Zhang, and J. Chan. 2019. “Hyperspectral Images Classification Based on Dense Convolutional Networks with Spectral-Wise Attention Mechanism.” Remote Sensing 11 (2): 159. https://doi.org/10.3390/rs11020159.
  • Fu, J., J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, 2019. Dual Attention Network for Scene Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 3141–3149.
  • Gao, Q., and S. Lim. 2019. “A Probabilistic Fusion of a Support Vector Machine and a Joint Sparsity Model for Hyperspectral Imagery Classification.” GIScience & Remote Sensing 56 (8): 1129–1147. https://doi.org/10.1080/15481603.2019.1623003.
  • Hao, S., W. Wang, Y. Ye, T. Nie, and L. Bruzzone. 2018. “Two-Stream Deep Architecture for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 56 (4): 2349–2361. https://doi.org/10.1109/TGRS.2017.2778343.
  • Haut, J. M., M. Paoletti, J. Plaza, and A. Plaza. 2017. “Cloud Implementation of the K-Means Algorithm for Hyperspectral Image Analysis.” The Journal of Supercomputing 73: 514–529.
  • He, X., Y. Chen, and Z. Lin. 2021. “Spatial-Spectral Transformer for Hyperspectral Image Classification.” Remote Sensing 13 (3): 3–498. https://doi.org/10.3390/rs13030498.
  • Hong, D., Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, and J. Chanussot. 2022. “SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers.” IEEE Transactions on Geoscience and Remote Sensing 60:1–15. https://doi.org/10.1109/TGRS.2022.3172371.
  • Hou, Q., M. M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. S. Torr. 2019. “Deeply Supervised Salient Object Detection with Short Connections.” IEEE Transactions on Pattern Analysis & Machine Intelligence 41 (4): 815–828. https://doi.org/10.1109/TPAMI.2018.2815688.
  • Huang, H., F. Luo, J. Liu, and Y. Yang. 2015. “Dimensionality Reduction of Hyperspectral Images Based on Sparse Discriminant Manifold Embedding.” ISPRS Journal of Photogrammetry & Remote Sensing 106:42–54. https://doi.org/10.1016/j.isprsjprs.2015.04.015.
  • Huang, H., C. Pu, Y. Li, and Y. Duan. 2020. “Adaptive Residual Convolutional Neural Network for Hyperspectral Image Classification.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 13:2520–2531. https://doi.org/10.1109/JSTARS.2020.2995445.
  • Hu, J., L. Shen, and G. Sun, 2018. Squeeze-And-Excitation Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 7132–7141.
  • Jiang, J., J. Ma, and X. Liu. 2022. “Multilayer Spectral–Spatial Graphs for Label Noisy Robust Hyperspectral Image Classification.” IEEE Transactions on Neural Networks and Learning Systems 33 (2): 839–852. https://doi.org/10.1109/TNNLS.2020.3029523.
  • Jia, S., L. Shen, J. Zhu, and Q. Li. 2018. “A 3-D Gabor Phase-Based Coding and Matching Framework for Hyperspectral Imagery Classification.” IEEE Transactions on Cybernetics 48 (4): 1176–1188. https://doi.org/10.1109/TCYB.2017.2682846.
  • Lee, H., and H. Kwon. 2017. “Going Deeper with Contextual CNN for Hyperspectral Image Classification.” IEEE Transactions on Image Processing 26 (10): 4843–4855. https://doi.org/10.1109/TIP.2017.2725580.
  • Liang, J., J. Zhou, Y. Qian, L. Wen, X. Bai, and Y. Gao. 2017. “On the Sampling Strategy for Evaluation of Spectral-Spatial Methods in Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 55 (2): 862–880. https://doi.org/10.1109/TGRS.2016.2616489.
  • Lin, M., W. Jing, D. Di, G. Chen, and H. Song. 2022. “Multi-Scale U-Shape MLP for Hyperspectral Image Classification.” IEEE Geoscience & Remote Sensing Letters 19:1–5. https://doi.org/10.1109/LGRS.2022.3141547.
  • Lin, Z., and L. Yan. 2016. “A Support Vector Machine Classifier Based on a New Kernel Function Model for Hyperspectral Data.” GIScience & Remote Sensing 53 (1): 85–101. https://doi.org/10.1080/15481603.2015.1114199.
  • Liu, Z., H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. 2022. “A ConvNet for the 2020s.“ IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, New Orleans. 11966–11976.
  • Li, Y., H. Zhang, and Q. Shen. 2017. “Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional Neural Network.” Remote Sensing 9 (1): 67. https://doi.org/10.3390/rs9010067.
  • Li, R., S. Zheng, Y. Wang, X. Yang, and X. Wang. 2020. “Classification of Hyperspectral Image Based on Double-Branch Dual-Attention Mechanism Network.” Remote Sensing 9 (3): 582–606. https://doi.org/10.3390/rs12030582.
  • Loshchilov, I., and F. Hutter. 2019. “Decoupled Weight Decay Regularization.“ International Conference on Learning Representations (LCLR), New Orleans, LA, USA.
  • Madasu, A., and A. Rao Vijjini. 2019. “Effectiveness of Self Normalizing Neural Networks for Text Classification.“ Computational Linguistics and Intelligent Text Processing: 20th International Conference 13452: 412–423. La Rochelle, France. arXiv:1905.01338v1. https://doi.org/10.1007/978-3-031-24340-0_31.
  • Ma, N., X. Zhang, H.-T. Zheng, and J. Sun. 2018. “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design.“ European Conference on Computer Vision (ECCV), Munich, Germany, 116–131. https://doi.org/10.1007/978-3-030-01264-9_8.
  • Mok, T. C. W., and A. C. S. Chung, 2020. Fast Symmetric Diffeomorphic Image Registration with Convolutional Neural Networks, 2020. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, 4643–4652.
  • Mou, L., P. Ghamisi, and X. X. Zhu. 2017. “Deep Recurrent Neural Networks for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 55 (7): 3639–3655. https://doi.org/10.1109/TGRS.2016.2636241.
  • Nalepa, J., M. Myller, and M. Kawulok. 2019. “Validating Hyperspectral Image Segmentation.” IEEE Geoscience & Remote Sensing Letters 16 (8): 1264–1268. https://doi.org/10.1109/LGRS.2019.2895697.
  • Paoletti, M. E., J. M. Haut, J. Plaza, and A. Plaza. 2018. “A New Deep Convolutional Neural Network for Fast Hyperspectral Image Classification.” ISPRS Journal of Photogrammetry & Remote Sensing 145:120–147. https://doi.org/10.1016/j.isprsjprs.2017.11.021.
  • Paul, A., S. Bhattacharya, D. Dutta, J. R. Sharma, and V. K. Dadhwal. 2015. “Band Selection in Hyperspectral Imagery Using Spatial Cluster Mean and Genetic Algorithms.” GIScience & Remote Sensing 52 (6): 643–659. https://doi.org/10.1080/15481603.2015.1075180.
  • Pesaresi, M., A. Gerhardinger, and F. Kayitakire. 2008. “A Robust Built-Up Area Presence Index by Anisotropic Rotation-Invariant Textural Measure.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 1 (3): 180–192. https://doi.org/10.1109/JSTARS.2008.2002869.
  • Quan, Y., M. Li, Y. Hao, J. Liu, and B. Wang. 2023. “Tree Species Classification in a Typical Natural Secondary Forest Using UAV-Borne LiDar and Hyperspectral Data.” GIScience & Remote Sensing 60 (1): 2171706. https://doi.org/10.1080/15481603.2023.2171706.
  • Santara, A., K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, and P. Mitra. 2017. “BASS Net: Band-Adaptive Spectral-Spatial Feature Learning Neural Network for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 55 (9): 5293–5301. https://doi.org/10.1109/TGRS.2017.2705073.
  • Shen, Y., S. Zhu, C. Chen, Q. Du, L. Xiao, J. Chen, and D. Pan. 2021. “Efficient Deep Learning of Nonlocal Features for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 59 (7): 6029–6043. https://doi.org/10.1109/TGRS.2020.3014286.
  • Sothe, C., C. M. De Almeida, M. B. Schimalski, L. E. C. La Rosa, J. D. B. Castro, R. Q. Feitosa, M. Dalponte, et al. 2020. “Comparative Performance of Convolutional Neural Network, Weighted and Conventional Support Vector Machine and Random Forest for Classifying Tree Species Using Hyperspectral and Photogrammetric Data.” GIScience & Remote Sensing 57 (3): 369–394. https://doi.org/10.1080/15481603.2020.1712102.
  • Stuart, M., Willmott, M.W, and. 2019. “Hyperspectral Imaging in Environmental Monitoring: A Review of Recent Developments and Technological Advances in Compact Field Deployable Systems.” Sensors 19 (14): 3071. https://doi.org/10.3390/s19143071.
  • Sun, L., G. Zhao, Y. Zheng, Z. Wu, Y. Ban, X. Li, and B. Zhang. 2022. “Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 60:1–14. https://doi.org/10.1109/TGRS.2022.3231215.
  • Tang, Y., Y. Lu, and H. Yuan. 2015. “Hyperspectral Image Classification Based on Three-Dimensional Scattering Wavelet Transform.” IEEE Transactions on Geoscience and Remote Sensing 53 (5): 2467–2480. https://doi.org/10.1109/TGRS.2014.2360672.
  • Tan, M., and Q. Le. 2021. “EfficientNetv2: Smaller Models and Faster Training.“ Proceedings of the 38th International Conference on Machine Learning 139:10096–10106. New York.
  • Wang, W., S. Dou, Z. Jiang, and L. Sun. 2018. “A Fast Dense Spectral–Spatial Convolution Network Framework for Hyperspectral Images Classification.” Remote Sensing 10 (7): 1068. https://doi.org/10.3390/rs10071068.
  • Wang, Z., H. Hu, L. Zhang, and J.-H. Xue. 2018. “Discriminatively Guided Filtering (DGF) for Hyperspectral Image Classification.” Neurocomputing 275:1981–1987. https://doi.org/10.1016/j.neucom.2017.10.046.
  • Wang, Y., K. Li, L. Xu, Q. Wei, F. Wang, and Y. Chen. 2021. “A Depthwise Separable Fully Convolutional ResNet with ConvCrf for Semisupervised Hyperspectral Image Classification.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 14:4621–4632. https://doi.org/10.1109/JSTARS.2021.3073661.
  • Woo, S., J. Park, J.-Y. Lee, and I. Kweon. 2018. “CBAM: Convolutional Block Attention Module.“ European Conference on Computer Vision (ECCV), Munich, Germany , 3–19.
  • Xu, Y., B. Du, and L. Zhang. 2020. “Beyond the Patchwise Classification: Spectral-Spatial Fully Convolutional Networks for Hyperspectral Image Classification.” IEEE Transactions on Big Data 6 (3): 492–506. https://doi.org/10.1109/TBDATA.2019.2923243.
  • Xu, Y., B. Du, L. Zhang, D. Cerra, M. Pato, E. Carmona, S. Prasad, N. Yokoya, R. Hänsch, and B. L. Saux. 2019. “Advanced Multi-Sensor Optical Remote Sensing for Urban Land Use and Land Cover Classification: Outcome of the 2018 IEEE GRSS Data Fusion Contest.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 12 (6): 1709–1724. https://doi.org/10.1109/JSTARS.2019.2911113.
  • Xu, X., W. Li, Q. Ran, Q. Du, L. Gao, and B. Zhang. 2018. “Multisource Remote Sensing Data Classification Based on Convolutional Neural Network.” IEEE Transactions on Geoscience and Remote Sensing 56 (2): 937–949. https://doi.org/10.1109/TGRS.2017.2756851.
  • Xu, Y., L. Zhang, B. Du, and F. Zhang. 2018. “Spectral–Spatial Unified Networks for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 56:5893–5909. https://doi.org/10.1109/TGRS.2018.2827407.
  • Yokoya, N., and A. Iwasaki. 2016. “Airborne Hyperspectral Data Over Chikusei.” Space Applications Lab University Tokyo Japan Technology Reports. SAL 05-27-2016. May 2016.
  • Yu, S., S. Jia, and C. Xu. 2017. “Convolutional Neural Networks for Hyperspectral Image Classification.” Neurocomputing 219:88–98. https://doi.org/10.1016/j.neucom.2016.09.010.
  • Zhang, X., G. Sun, X. Jia, L. Wu, A. Zhang, J. Ren, H. Fu, and Y. Yao, 2021. Spectral-Spatial Self-Attention Networks for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–15. https://doi.org/10.1109/TGRS.2021.3102143 .
  • Zheng, Z., X. Zhang, P. Xiao, and Z. Li. 2021. “Integrating Gate and Attention Modules for High-Resolution Image Semantic Segmentation.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 14:4530–4546. https://doi.org/10.1109/JSTARS.2021.3071353.
  • Zheng, Z., Y. Zhong, A. Ma, and L. Zhang. 2020. “FPGA: Fast Patch-Free Global Learning Framework for Fully End-To-End Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 58 (8): 5612–5626. https://doi.org/10.1109/TGRS.2020.2967821.
  • Zhong, Z., J. Li, Z. Luo, and M. Chapman. 2018. “Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework.” IEEE Transactions on Geoscience and Remote Sensing 56 (2): 847–858. https://doi.org/10.1109/TGRS.2017.2755542.
  • Zhu, Q., W. Deng, Z. Zheng, Y. Zhong, Q. Guan, W. Lin, L. Zhang, and D. Li. 2021. A Spectral-Spatial-Dependent Global Learning Framework for Insufficient and Imbalanced Hyperspectral Image Classification. IEEE Transactions on Cybernetics. early access. https://doi.org/10.1109/TCYB.2021.3070577.
  • Zhu, M., L. Jiao, F. Liu, S. Yang, and J. Wang. 2020. “Residual Spectral–Spatial Attention Network for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 59 (1): 449–462. https://doi.org/10.1109/TGRS.2020.2994057.
  • Zou, L., X. Zhu, C. Wu, Y. Liu, and L. Qu. 2020. “Spectral–Spatial Exploration for Hyperspectral Image Classification via the Fusion of Fully Convolutional Networks.” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing 13:659–674. https://doi.org/10.1109/JSTARS.2020.2968179.