282
Views
0
CrossRef citations to date
0
Altmetric
Research Article

VBNet: A Visually-Aware Biomimetic Network for Simulating the Human Eye’s Visual System

, , ORCID Icon &
Article: 2335100 | Received 22 Oct 2023, Accepted 21 Mar 2024, Published online: 01 Apr 2024

ABSTRACT

In the rapidly advancing realms of computer vision and artificial intelligence, the quest for human-like intelligence is escalating. Central to this pursuit is visual perception, with the human eye as a paragon of efficiency in the natural world. Recent research has prominently embraced the emulation of the human eye’s visual system in computer vision. This paper introduces a pioneering approach, the visually-aware biomimetic network (VBNet), composed of a dual-branch parallel architecture: a Transformer branch emulating the peripheral retina for global feature dependencies and a CNN branch resembling the macular region for local details. Furthermore, it employs feature converter modules (CFC and TFC) to enhance information fusion between the branches. Empirical results highlight VBNet’s superiority over RegNet and PVT in ImageNet classification and competitive performance in MSCOCO object detection and instance segmentation. The dual-branch design, akin to the human visual system, enables simultaneous focus on local and global features, offering fresh perspectives for future research in the field of computer vision and artificial intelligence.

Introduction

Convolutional Neural Networks (CNNs) draw inspiration from the biological visual system. This system processes visual information through neurons that classify and organize it into local, spatially related features, such as edges, spots, and contours. Neurons in the visual system have limited visual receptive fields, processing local regions of input images. Similarly, CNNs employ convolutional layers to extract features from local regions. These networks excel in image classification, object detection, and instance segmentation. However, their performance is limited by the lack of global contextual information, akin to the fovea in the retina. One common approach to expand the receptive field in CNNs is stacking multiple convolutional layers, but this can compromise the network’s perception of local details. illustrates Effective Receptive Fields (ERFs) (Luo et al. Citation2016) of ResNet-34/50/101 and VBNet.

Figure 1. Erfs comparison between ResNet-34/50/101 and VBNet.

Figure 1. Erfs comparison between ResNet-34/50/101 and VBNet.

In recent years, Transformers, particularly Vision Transformers (ViT) (Dosovitskiy et al. Citation2021), have made significant strides in natural language processing and computer vision (CV). ViT partitions images into patches, applies a series of multi-layer Transformer blocks to capture relationships between patches, and constructs a global visual feature representation. However, ViT focuses on global features, neglecting local details and requiring large-scale datasets for optimal performance, such as JFT-300 M.

The human eye’s visual mechanism, illustrated in , consists of the fovea and peripheral retina. The fovea, with a high cone cell density, offers precise sampling, resolution, and grayscale depth. In contrast, the peripheral retina has a lower cone cell density, resulting in relatively blurry peripheral vision. This mechanism allows us to focus on key object details while maintaining a broader field of vision.

Figure 2. Human visual mechanism.

Figure 2. Human visual mechanism.

Based on this, we propose a novel model named “Visually-Aware Biomimetic Network” (VBNet), which amalgamates CNN and Transformer strengths, emulating human eye imaging. VBNet comprises stacked VBNet blocks, each with a CNN branch for local details and a Transformer branch for global awareness, arranged in parallel. The CNN branch employs deep convolution modules, using larger kernels and layer normalization to supplement local detail capture. The Transformer branch features self-attention modules and MLP units for global context and peripheral retina simulation. To bridge differences between CNN and Transformer features, we introduce Convolution Feature Converter (CFC) and Transformer Feature Converter (TFC). These modules align the channel dimension with 1 × 1 convolution and match spatial dimensions through down-sampling and up-sampling.

VBNet’s novel structure extracts global features with its global awareness and enhances object detail capture with its local detail capability. illustrates VBNet’s global awareness (light blue) alongside local detail capture (dark blue and red). VBNet excels with minimal training data, demonstrating the feasibility of bionic vision.

Related Work

Visual Biomimetic CNNs

CNNs have been the cornerstone of computer vision for over two decades. Inspired by human vision, models like VGGNet, ResNet, GoogleNet, and MobileNet have harnessed the biomimetic potential of convolution as a feature extractor. Recent advances such as SENet (Hu, Shen, and Sun Citation2018) and CBAM (Woo et al. Citation2018) have introduced visual attention mechanisms for enhanced feature representation. Additionally, eRPN (Wang et al. Citation2022) mitigates issues related to predefined anchor points and sample quality. In VBNet, we adopt a dual-branch approach, using CNN to capture local details (fovea) and ViT to establish long-range dependencies (peripheral retina), creating a comprehensive biomimetic neural network.

Vision Transformers

The transformative potential of Vision Transformers (ViT) in computer vision is evident. ViT leverages self-attention mechanisms to process long-distance dependencies by converting images into sequences. However, ViT has limitations in handling spatial transformations and local features due to its fixed-size block approach. To address these limitations, NLFFTNet (Shen and Wang Citation2022) improves local awareness, DeiT (Touvron et al. Citation2021) distills features, and T2TViT (Yuan et al. Citation2021) incorporates local structure. PVT (Wang et al. Citation2021) and CvT (Wu et al. Citation2021) blend self-attention and convolution. Furthermore, Window-based Vision Transformer (Liu et al. Citation2021) uses local window self-attention. In contrast, VBNet integrates the local feature capturing strength of CNN with the global awareness of Transformer in a dual-branch parallel structure.

Methods

The VBNet Block

The VBNet block, as shown in , is a deep learning architecture that consists of a CNN branch and a Transformer branch. In VBNet, input features are extracted and fused via the CNN and Transformer branches, respectively, to enhance the performance of the model. Specifically, the two branches’ outputs are integrated via the following equation:

(1) Xˆ=ADDCONVX,TRANSX(1)

Figure 3. VBNet block.

Figure 3. VBNet block.

where CONV represents the process of the CNN branch using depth-wise separable convolution to extract local detailed features, and TRANS is similar to CONV but takes the same input data as the Transformer branch. In the Transformer branch, the CFC module converts the previous layer’s feature dimension to match the ViT model’s format, and then the ViT module extracts features from a global perspective. Finally, the TFC module transforms the feature dimension back to the CNN branch’s format. Additionally, ADD stands for the element-wise addition of the outputs of the CONV and TRANS branches. After fusing these features, the model requires regularization to prevent overfitting. Here, the DROP function represents the random-depth regularization operation. By using the DROP function, the model can randomly close some neural network layers during training to increase the model’s robustness and generalization, hence effectively reducing the model’s error on testing data. Therefore, the VBNet block combines the advantages of the CNN and Transformer models to extract and fuse local detailed features and global characteristics, and the application of the DROP function can improve the model’s robustness and generalization capabilities.

CNN Branch

As illustrated in , we adopted various convolutional module architectures. In the CNN branch of the convolutional neural network, we used an inverted residual structure similar to MobileNet v2 (Sandler et al. Citation2018). Unlike the residual structure in ResNeXt, the inverted residual structure first performs channel-wise expansion using a 1 × 1 convolution kernel, then carries out feature extraction via depth-wise separable convolution, and finally reduces the dimensionality back to its original dimension via a 1 × 1 convolution kernel. In the high-dimensional information processing, the information lost after being processed by the activation function is relatively small. Therefore, using inverted residual structures can better preserve the local detailed information captured by the network.

Figure 4. Various convolutional module architectures.

Figure 4. Various convolutional module architectures.

In the convolutional neural network (CNN), it is typical to stack convolutional layers with a 3 × 3 kernel size. Stacking multiple 3 × 3 convolution layers can replace using larger convolution kernels for a single layer to decrease the parameter volume. However, there is a considerable deviation between the theoretical value and the actual value when comparing the ERFs. Moreover, with the continuous evolution of computer hardware, larger convolution kernels can be used to increase the ERFs. Hence, we enlarged the convolution kernel of depth-wise separable convolution from 3 × 3 to 5 × 5 to slightly expand the ERF and help integrate local features with the global semantic information extracted from the Transformer branch.

To enhance the integration between the CNN and Transformer branches and eliminate their exclusivity, we introduced corresponding style offsets for the CNN branch (shown in ). During feature extraction, we used deep convolution and a GELU activation function identical to the ViT model. The GELU activation function can be represented by the following equation:

(2) GELUx=xPXx=xΦx=x121+erfx/2(2)

where erf represents the error function, ϕ represents the cumulative distribution function of the standard normal distribution, and erf represents the probability density function of the standard normal distribution. The derivative of the GELU activation function can be calculated as follows:

(3) GELUx= ∂xGELUx=Φx+xϕx(3)

where ϕ represents the probability density function of the standard normal distribution.

To improve the integration of the CNN and Transformer branches while avoiding their exclusivity, we incorporated Layer Normalization into the CNN branch’s output. Layer Normalization normalizes the CNN branch’s output over the channel dimension within a batch, addressing gradient vanishing issues and enhancing model stability and convergence speed. This approach is distinct from Batch Normalization, as Layer Normalization is well-suited for modeling independent statistics for each batch sample, ensuring that different input samples to the network remain independent. We opted for Layer Normalization in designing the VBNet block due to its lower computational cost compared to Batch Normalization, making it an effective and practical choice, particularly for datasets with limited samples or substantial variations. Moreover, as Layer Normalization operates at the channel level, it can adapt more effectively to various data distributions and feature changes, thereby improving the handling of complex nonlinear features.

Transformer Branch

To integrate features from the CNN and Transformer branches, we must consider differences in their feature formats. CNN-extracted features are typically a 3D tensor, with dimensions for channel, height, and width. In contrast, Transformer-derived features are commonly represented as a 2D matrix with dimensions for block number and embedding size.

To merge the complementary information from CNN and Transformer models, we introduce two feature conversion units: Convolution Feature Converter (CFC) and Transformer Feature Converter (TFC), depicted in . These units act as intermediaries, enabling the fusion and joint utilization of both feature types for downstream tasks.

Figure 5. Operation flow of CFC and TFC feature conversion unit.

Figure 5. Operation flow of CFC and TFC feature conversion unit.

In the CFC, the CNN branch’s 3D tensor feature is initially converted into a 2D matrix by flattening the height and width dimensions post-pooling. We then introduce an additional channel dimension to match the Transformer branch’s embedding dimension E, ensuring uniform dimensional format for both CNN and Transformer features. Flattening is also applied to ensure consistent spatial dimensions.

For the TFC, we transform the Transformer branch’s block features into a 3D tensor with the same channel dimension as the CNN features. Subsequently, convolution kernels are employed to adjust the block features’ height and width dimensions to match the CNN features. This uniformity allows seamless fusion of Transformer and CNN features.

This approach successfully integrates CNN and Transformer features, enhancing model performance and accuracy. Furthermore, its versatility makes it applicable to other deep learning models based on CNN and Transformer architectures.

As illustrated in , the transformer block is composed of a multi-head self-attention module and a multilayer perceptron. To reduce the number of parameters, we have set the embedding dimension to 144 and have chosen to use the GELU activation function. We have also embedded layer normalization before the multi-head self-attention and MLP. The multi-head self-attention (2-head) is defined as follows:

O=MultiheadQ,K,V=W0concatV1softmax\cdotsoftmaxk2TQ2dmV2softmax\cdotsoftmaxk2TQ2dm(4)

where W0 is the given parameter matrix, V1 and V2 are the value vectors of the first and second heads, respectively, and k2 is the key vector of the second head. Q, K, and V represent the query vector, key vector, and value vector, respectively, which are obtained after the first linear transformation of the embedding. Moreover, dm denotes the embedding dimension. depicts the process of the transformer block.

Figure 6. Various convolutional module architectures.

Figure 6. Various convolutional module architectures.

In patch embedding, position embedding becomes unnecessary as CNN’s sliding window convolution serves a function akin to image position coding. This design reduces computation and improves feature layer fusion quality, minimizing mutual interference. Feature fusion results in the CNN’s feature dimension, negating the need for a class token. A classifier utilizing cross-entropy loss supervises the score post-feature fusion.

The Parallel Design

The inspiration for the parallel structure design comes from the human eye’s visual mechanism. When light signals are input to retinal neural fibers, whether responsible for high-definition imaging in the central foveal region or object motion perception in the peripheral retina region, they simultaneously convert and transmit received light signals as electrical signals in the neural network.

Our designed parallel structure operates similarly. Initially, the image is input to both the CNN and Transformer branches for concurrent feature extraction. Subsequently, feature maps from both branches are fused and connected via shortcut connections. The key distinction from previous works (Touvron et al. Citation2021; Wang et al. Citation2021; Wu et al. Citation2021; Zeng et al. Citation2022) is the parallel processing of input information in our structure, in contrast to previous works that integrated convolution in the Transformer or used the Transformer solely for feature fusion. These approaches underutilize the strengths of convolution and Transformer, while our parallel structure leverages the full potential of both.

Network Specification

Overall Architecture

In VBNet, we simulate human eye movements, known as saccades, by using ViT’s attention mechanism. ViT provides global feature extraction for rapid retrieval of relevant information akin to the target. CNN then enhances detailed feature extraction. Our neural network employs a dual-branch parallel structure for global perception and local detail extraction. Stacking multiple VBNet Blocks facilitates concurrent global and local operations, promoting interaction between CNN and the transformer, thereby realizing the intended visual mechanism.

Specifically, based on the VBNet Block, we construct a VBN network architecture that combines local feature capture and global feature representation, as shown in . Unlike ViT, VBNet adopts a hierarchical structure with four stages, each with a downsampling ratio of {4, 8, 16, 32}. The hierarchical structure is designed to effectively model the human visual mechanism while obtaining high-level semantic information that can be better applied to downstream tasks. Additionally, global average pooling is introduced at the end to map [C×H×W] to a C-dimensional vector representation, thereby maximizing the retention of fused local and global features. Finally, a projection layer is added to compress the channels to 1000 for classification. Similar structures can also be found in other networks, such as Swin Transformer.

Figure 7. The overall architecture of VBNet.

Figure 7. The overall architecture of VBNet.

The VBNet primarily comprises Conv Stem, VBNet Block, Stride-Conv Block, and classifier. The Conv Stem elevates the channel of the input image from 3-dimensional to C-dimensional. Each stage consists of multiple VBN Blocks, and the Stride-Conv is a convolution with a [2 × 2.] kernel and a stride of 2, combined with layer normalization, with the number of channels doubling for each Stride-Conv channel. After global average pooling, layer normalization is applied again, followed by a fully connected layer for classification.

Architecture Variants

In VBNet, a deep neural network architecture, we employ a multi-stage approach where multiple building blocks are stacked together to form each stage. Notably, the number of building blocks is greater in the two central stages. At each stage, the channel dimension is expanded by a factor of two, allowing for a richer representation of the input data. Furthermore, we observe variations in the number of tokens across the three VBNet variants that we have constructed using this configuration. Comprehensive details regarding these variants can be found in .

Table 1. VBN structural variants.

Experimental Analyses

To evaluate the effectiveness of VBN in image classification, object detection, and instance segmentation tasks, we conducted experiments on the ImageNet-1K (Chollet Citation2017) and COCO2017 (Lin et al. Citation2014) public datasets. The ImageNet-1K dataset consists of a total of 1,281,167 training images and 50,000 validation images across 1000 different classes. The COCO2017 dataset includes 118k images for training and 5k images for validation. To validate the feasibility of the model, we conducted a series of ablation experiments and other tests.

Image Classification Experiment

During the experiments, VBN was trained and tested on the ImageNet-1K dataset. To ensure fairness and comparability with previous works (Touvron et al. Citation2021), VBN adopted data augmentation techniques, including Mixup (Zhang et al. Citation2018), CutMix (Yun et al. Citation2019), Random-Erasing (Zhong et al. Citation2020), and Rand-Augment (Cubuk et al. Citation2020). The input image size was set to 224 × 224, and AdamW optimizer was used with a weight decay value of 0.04. The model was trained for 300 epochs, with a batch size of 1024. The base learning rate was set to 0.01, and the learning rate was adjusted using cosine annealing.

compares VBN’s Top-1 classification accuracy with several convolutional neural networks and different ViT architectures on the ImageNet dataset. The results indicate that VBN outperforms ResNet, RegNetY (Radosavovic et al. Citation2020), SENet and CBAM which employ attention mechanisms. For instance, VBN-B1’s Top-1 accuracy is 6.5% higher than that of ResNet18 + CBAM, with only 67% of its computing cost. This indicates that VBNet’s global representation capability has a significant advantage in image classification tasks. Furthermore, when comparing with different variants of ViT, we found that VBN-B2 has a lower computational cost (only 50% of DeiT-S, and 58% of RegNetY-4 G) than similar Top-1 accuracy, highlighting its good ability to capture fine details, while also revealing the importance of local responses.

Table 2. Comparison of VBN with different CNNs and ViTs on ImageNet.

Ablation Studies

In this section, we conduct ablation experiments to assess VBNet-B1’s effectiveness and report classification accuracy changes for different variants. Ablation experiments are a common practice in deep learning, aiding in the systematic evaluation of model components. Past studies often employed ImageNet-1k for such experiments, a resource-intensive and time-consuming dataset. However, for ablation studies, comparing validation set results among variants suffices without transferring trained weights to downstream tasks. Hence, we advocate the use of Mini-ImageNet for rapid assessment. Mini-ImageNet draws from ImageNet, comprising 60,000 images across 100 classes, further divided into 48,000 training images and 12,000 validation images.

Kernel Sizes in Depth-Wise Convolution

In CNNs, kernel size choice significantly impacts feature extraction from input data. We investigate the effect of various kernel sizes in depth-wise convolution on VBNet’s overall performance.

showcases experimental results with different kernel sizes on the Mini-Imagenet dataset. Notably, a 3 × 3 kernel yields 79.6% accuracy, while a 5 × 5 kernel achieves 80.2%. This emphasizes the need for interplay between separate CNN and transformer branches for effective feature fusion.

Table 3. Comparison of accuracy of different kernel sizes in depth-wise convolution.

Increasing kernel size in depth-wise convolution gradually expands the receptive field in a single block, enhancing integration with the transformer branch. However, a 13 × 13 kernel results in a 1.0% accuracy drop compared to 5 × 5. Thus, we opt for a 5 × 5 window size in VBNet, offering a balance between global contextual information and local detail capture.

details accuracy comparisons among different kernel sizes in depth-wise convolution, covering parameter count, FLOPs, and accuracy rate. Notably, a 5 × 5 kernel attains 80.2% accuracy with just 6.57 M parameters and 1.18 G FLOPs.

Block Modifications

In convolutional neural networks, a straight block architecture is often used. However, the residual bottleneck structure in ResNeXt provides a new way to construct a block. In this paper, we propose the use of the inverted bottleneck structure, similar to MobileNetv2, to construct the same architecture. We evaluate our proposed VBNet architecture on a classification task and present the experimental results in . To provide fair comparisons, we adjust the channel dimensions to ensure similar FLOPs. Our results demonstrate that by using VBNet with inverted bottleneck after dimension upgrading, the accuracy can be significantly improved by up to 8.2% relative to a straight block architecture.

Table 4. Comparison of accuracy of different block structures on a benchmark dataset.

Embed Dimension

The embedding dimension in the Transformer branch is a hyperparameter that requires experimental confirmation. In this paper, we compared the performance of the Transformer branch under different embedding dimensions and found that there are significant differences in the required network parameters and computational complexity, as shown in . Particularly, when the embedding dimension increased from 192 to 384, the network parameters and computational complexity increased by more than three times, which requires higher hardware requirements. Additionally, higher embedding dimensions did not contribute much to accuracy improvement, with only a 0.1% increase in accuracy when the embedding dimension was 384. This is because higher embedding dimensions will cause the Transformer branch to occupy more network parameters, making the network more inclined to global perception and ignoring local information. To balance between the two, this study selected an embedding dimension of 144 with higher cost-effectiveness to construct VBNet.

Table 5. Impact of embedding dimension on VBNet’s performance.

Comparison of VBNet with Ensemble Models

To ensure fairness, we built some other networks with the same parallel structure as VBNet, and adjusted the channel dimensions of each stage to maintain equal overall computing costs. The experimental results are presented in . Compared to the new network formed by these two combinations, VBNet achieved 8.5% higher accuracy, indicating that its two branches effectively extract features locally and globally.

Table 6. Comparison of VBNet with ensemble models.

Stages Feature Analysis

To assess the effectiveness of our proposed VBNet architecture, we performed feature analysis using Grad-Cam visualization (Zhou et al. Citation2016) on ResNet-50, ViT-B, and VBNet-B2. shows the feature output of each network at different stages.

Figure 8. Grad-Cam visualization of ResNet-50, ViT-B, and VBNet-B2 at different stages.

Figure 8. Grad-Cam visualization of ResNet-50, ViT-B, and VBNet-B2 at different stages.

In the first two stages of the three considered networks, VBNet-B2 and ViT-B effectively extract required semantic information globally, enabling subsequent stages to capture details more accurately. In contrast, ResNet-50 lacks global representation in the first two stages. In the last stage, VBNet-B2, with its ability to capture details, focuses more on local information than ViT-B, enabling it to extract target features more accurately for downstream tasks.

Furthermore, in stage 4 of ResNet-50 and VBNet-B2, VBNet-B2 is more focused on the target, similar to the area of the fovea. This observation validates the effectiveness of the visual mechanism of VBNet, which is designed to extract and represent target features efficiently.

Our feature analysis results demonstrate that VBNet effectively captures both local and global features throughout the network, enabling it to achieve superior performance over other networks with similar architectures.

Object Detection and Instance Segmentation Experiment

Experimental Design

To verify the effectiveness of VBNet in downstream tasks, we used the COCO 2017 dataset and compared it with previous methods using the Mask R-CNN method for object detection and instance segmentation tasks. In the experiment, we used backbone networks pre-trained on the ImageNet dataset and trained for 12 epochs using the AdamW optimizer. During the training process, the batch size was set to 32, the initial learning rate was 0.0002, and a learning rate decay of one order of magnitude was performed at the 8th and 11th epochs, with a weight decay of 0.0001 and a dropout rate of 0.1.

Experimental Results

The comparison results of object detection and instance segmentation on the COCO 2017 dataset are shown in . The table shows that the three VBNet models we constructed as the backbone network for Mask R-CNN always outperformed other methods. It is worth noting that VBNet-B2 performs 7.6 box mAP higher than ResNet18 with the same number of parameters (31 M), which is a remarkable performance. In addition, VBNet exhibits similar efficiency in object detection and instance segmentation tasks as it does in image classification tasks, achieving higher performance with fewer computing resources. Especially noteworthy, our VBNet-B3 (42 M) achieved a performance that surpassed Swin-T (48 M) by 0.9 box mAP and 0.8 mask mAP in object detection and instance segmentation tasks. These experimental results fully demonstrate that VBNet with visual mechanisms has outstanding performance in downstream tasks.

Table 7. Object detection and instance segmentation comparison on MSCOCO dataset.

Conclusion and Discussion

In this paper, we successfully constructed a visually-aware biomimetic network (VBNet) by integrating Transformer and CNN branches. Our experimental results demonstrate that, under similar parameter and computational constraints, VBNet, with human-like visual mechanisms, significantly outperforms traditional CNN and Transformer models, and has enormous potential as a backbone for downstream tasks. However, it should be noted that our offsets in the CNN branch to accommodate ViT may not be the best approach for integrating CNN with Transformer. Therefore, exploring better integration methods is an interesting and important problem for future research. Moreover, the global self-attention mechanism of the Transformer is computationally expensive and occupies many parameters in the network. While applying local-window self-attention or shift-windows self-attention may potentially reduce computation costs, neither of these methods can fully leverage the Transformer’s ability to model global dependencies. Thus, striking a balance between the complexity of the Transformer and the representation of global characteristics is a topic that requires further investigation.

Additionally, although our proposed system is based on human visual mechanisms, it still has limitations as it relies on deep learning. Future research should aim to break down the barriers of deep learning by cross-integrating biology, mathematics, and computer science to develop a strong artificial intelligence vision mechanism. Furthermore, it is important to remain open to alternative approaches beyond biomimetics, as the invention of aircraft evolved from the bionic study of bird flight to the field of aerodynamics.

Acknowledgement

This research was supported by Zigong Science and Technology Program of China(Grant No.2019YYJC15),Nature Science Foundation of Sichuan University of Science & Engineering (Grant No. 2020RC32),Key Laboratory of Higher Education of Sichuan Province for Enterprise Informationalization and Internet of Things(Grant No.2022WZJ02),Graduate Innovation Fund of Sichuan University of Science & Engineering (Grant No.Y2022143),National Training Programs of Innovation and Entrepreneurship for Undergraduates of China(Grant No.S202210622097),and The authors are deeply grateful to these supports.

Data availability statement

The data that support the findings of this study are available in (Lin et al. Citation2014), and the ImageNet dataset. These data were derived from the following resources available in the public domain:

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work was supported by the Key Laboratory of Higher Education of Sichuan Province for Enterprise Informationalization and Internet of Things [No.2022WZJ02]; Nature Science Foundation of Sichuan University of Science & Engineering [No.2020RC32]; National Training Programs of Innovation and Entrepreneurship for Undergraduates of China [No.S202210622097]; Graduate Innovation Fund of Sichuan University of Science & Engineering [No.Y2022143]; Zigong Science and Technology Program of China [No.2019YYJC15].

References

  • Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 1251–18.
  • Cubuk, E. D., B. Zoph, J. Shlens, and Q. V. Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 702–03.
  • Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale (arXiv:2010.11929). arXiv doi:10.48550/arXiv.2010.11929.
  • He, K., X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 770–78.
  • Hu, J., L. Shen, and G. Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 7132–41.
  • Lin, T.-Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Computer vision – ECCV 2014, ed. D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, 740–55. Zurich, Switzerland: Springer International Publishing.
  • Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo 2021. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10012–22.
  • Luo, W., Y. Li, R. Urtasun, and R. Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, ed. D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, vol. 29. Curran Associates, Inc.
  • Radosavovic, I., R. P. Kosaraju, R. Girshick, K. He, and P. Dollar. 2020. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 10428–36.
  • Sandler, M., A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 4510–20.
  • Shen, L., and Y. Wang. 2022. TCCT: Tightly-coupled convolutional transformer on time series forecasting. Neurocomputing 480:131–45. doi:10.1016/j.neucom.2022.01.039.
  • Touvron, H., M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, 10347–57.
  • Wang, W., E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 568–78.
  • Wang, Q., S. Zhang, Y. Qian, G. Zhang, and H. Wang. 2022. Enhancing representation learning by exploiting effective receptive fields for object detection. Neurocomputing 481:22–32. doi:10.1016/j.neucom.2022.01.020.
  • Woo, S., J. Park, J.-Y. Lee, and I. S. Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 3–19
  • Wu, H., B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. 2021. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 22–31.
  • Yuan, L., Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. H. Tay, J. Feng, and S. Yan. 2021. Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 558–67.
  • Yu, W., M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan. 2022. MetaFormer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 10819–29.
  • Yun, S., D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. 2019. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 6023–32.
  • Zeng, K., Q. Ma, J. Wu, S. Xiang, T. Shen, and L. Zhang. 2022. Nlfftnet: A non-local feature fusion transformer network for multi-scale object detection. Neurocomputing 493:15–27. doi:10.1016/j.neucom.2022.04.062.
  • Zhang, H., M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412. arXiv.
  • Zhong, Z., L. Zheng, G. Kang, S. Li, and Y. Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York City, USA, 34 (07), Article 07.
  • Zhou, B., A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2921–29.