Full article: Urban object detection algorithm based on feature enhancement and progressive dynamic aggregation strategy

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Traditional target detection models face challenges in recognizing urban high-altitude remote sensing targets due to complex background noise and significant variations in target scale. These challenges can result in loss of feature information and missed object detection. In light of this, this article introduces a novel dual-gated feature mechanism and adaptive fusion strategy. First, the dual-gated feature mechanism enables selective suppression or enhancement of multilevel features, thereby reducing the interference of complex environmental noise in remote sensing on feature fusion. Second, the adaptive fusion strategy and module facilitate multilevel scale feature fusion, and by dynamically learning fusion weights, they mitigate scale conflicts during the feature extraction process and preserve feature information. Experimental comparisons and analysis on the RSOD and NWPU VHR-10 public datasets showcase the effectiveness of the proposed method. In comparison to current mainstream detection methods, the improved approach presented in this article demonstrates significant advantages in terms of detection performance and efficiency.

Keywords:

1. Introduction

Recently, high-altitude remote sensing technology has found widespread application in various fields, including significant areas such as natural resource management, environmental monitoring and urban planning with high environmental adaptability (Zhang et al. Citation2018). Monitoring and analysing urban expansion become feasible by acquiring high-resolution remote sensing images, which offer data support for urban planning. Additionally, it can assist city administrators in urban construction and management by classifying and recognizing urban targets (Ahern Citation2011).

Early remote sensing detection methods were based on Synthetic Aperture Radar (SAR) and Geographic Information System (GIS) techniques. SAR images, which can be acquired under any weather conditions and are not affected by disturbances like clouds and fog, were frequently used (Maguire Citation1991). However, processing SAR images has limitations such as the number of available sensors, long revisit periods and relatively lower resolution. GIS technology emerged in the late twentieth century, facilitating the development of remote sensing digital image processing and advancing remote sensing data acquisition and processing techniques. One of the key achievements was the development of multisource remote sensing data extraction and fusion techniques. Nevertheless, the acquisition and processing costs for GIS remote sensing data tend to be high. There is also a need for massive data support while the use of GIS technology standards varies across fields (Ceccato and Snickars Citation2000). Over the past few years, deep learning has emerged as a crucial technology for remote sensing detection. Training on limited data and automatically extracting complex features, deep learning technology has yielded substantial performance advantages for remote sensing tasks in detecting and recognizing images. Its application provides more opportunities for remote sensing detection.

In recent years, convolutional neural networks (CNNs) have demonstrated remarkable performance in the fields of deep learning, specifically in areas like image classification, semantic segmentation and object detection (Chouhan et al. Citation2019). From a detection perspective, algorithms for object detection using deep learning can be categorized into two types. The first type is the two-stage method, which involves initially determining the number and positioning of candidate bounding boxes and subsequently conducting precise classification on these candidates using a classifier to obtain the object detection outcomees (Girshick Citation2015). Well-known algorithms in this category include Regions with Convolutional Neural Network (RCNN), Fast RCNN and Faster RCNN. While two-stage algorithms generally exhibit superior accuracy, they are often hindered by limited detection speed due to the requirement of generating candidate bounding boxes in advance, making real-time implementation challenging (Sun et al. Citation2018). One-stage algorithms integrate the prediction of candidate bounding boxes, classification and localization using a classifier into a unified stage. They directly estimate the position and category of the target. In comparison to two-stage algorithms, one-stage algorithms exhibit faster detection speed and real-time capability (Krizhevsky et al. Citation2017). Representative algorithms in this category include the SSD (Single Shot MultiBox Detector) series, RetinaNet and YOLO (You Only Look Once) series (Reis et al. Citation2023). The advancement of deep learning has resulted in the emergence of numerous deep learning-based object detection algorithms in the field of computer vision. Nevertheless, challenges regarding the subpar performance and missed detections in high-altitude object detection still persist (Li et al. Citation2020). Thus, achieving accurate object detection in complex backgrounds has become a critical concern in current research on remote sensing image object detection.

2. Related work

Nowadays, R-CNN, YOLO and other series of networks have been proved to be effective in natural image target detection, so researchers apply these networks to optical remote sensing image target detection. In 2012, Krizhevsky et al. improved the design of AlexNet and achieved the highest accuracy for image classification among convolutional neural network models in a computer vision competition (Krizhevsky et al. Citation2017). In 2015, He et al. introduced residual networks (ResNets) as a solution to the vanishing gradient problem (He et al. Citation2016). This method passes the input between layers and adds it to the result of the convolution. In 2013, Zhang et al. proposed a detection algorithm using an adaptive spatial subsampling visual attention model. This algorithm introduces a completely new approach to whole-image segmentation and feature detection (Zhang et al. Citation2013). In 2020, Tan et al. introduced a novel approach called the compound scaling method along with a weighted bidirectional feature pyramid network (BiFPN) to effectively scale the size of the network (Tan et al. Citation2020). In 2021, Del Prete et al. presented a deep learning network called RetinaNet for target detection, specifically designed to effectively detect ships on the sea surface (Del Prete et al. Citation2021). In 2022, Li et al. presented a multiscale feature extraction network specifically designed for water extraction. This network facilitates the localization and distribution analysis of water bodies in the ecological environment field (Li et al. Citation2022). In 2022, Alexey with his team introduced the ‘Expansion’ and ‘Composite Scaling’ methods in YoloV7, surpassing all previous models in the one-stage target detection accuracy (Wang et al. Citation2022). However, upper -air remote sensing encounters distinct challenges in urban (Hazaymeh et al. Citation2022) background interference. Additionally, urban objects often undergo significant scale changes and possess low texture features. These difficulties introduce new obstacles to the object detection task (Rottensteiner et al. Citation2014). To address these issues, this study presents a novel enhancement method based on the YOLOV7 model. The main contributions of this article are as follows:

Aiming at the feature loss problem, this article designs a dual-attention gating mechanism. Through adaptive weight selection and different channels of weighted feature maps, the representation ability of small targets in remote sensing images is effectively enhanced and extracted. The characteristic information can be better preserved and utilized when the complex background interferes, and the recognition accuracy is improved.
To address the problem of scale variation in remote sensing images, this article proposes an adaptive fusion strategy and designs an adaptive fusion module. The dynamic aggregation strategy can adapt to different sizes and proportions of targets, allowing for gradual aggregation of features at different scales and better capturing detailed information of targets of various sizes. This reduces information loss and improves the recognition accuracy of detected targets.
Through experiments conducted on the RSOD and NWPU VHR-10 public datasets, this article demonstrates significant improvement in remote sensing object detection tasks. Compared to traditional methods, the proposed algorithm achieves excellent results in terms of detection accuracy, recall rate and robustness, making it highly efficient for practical applications in remote sensing object detection tasks.

The organization of the rest of this article is as follows. Section 2 reviews related work. Section 3 provides a detailed description of the proposed improvement method. To evaluate our approach, experiments are designed in Section 4, and the proposed method is discussed. Finally, Section 5 summarizes the work presented in this article.

3. Method

As presented in , this article proposes a model that utilizes YOLOv7 as the fundamental model to investigate the inadequate fusion effect of multitarget features in remote sensing images (Wang et al. Citation2022). This research aims to mitigate missed detection and false detection during the detection process. In this section, we first introduce an adaptive fusion strategy for multifeature layers. We learn fusion weights based on the multiscale input objects, assign distinct fusion weights to multiscale features and achieve dynamic fusion of these features. Additionally, the gating feature mechanism enhances and filters the fused feature map. It selectively strengthens or suppresses adjusted multi-object detail features, preventing the submergence of multi-object features in conflicting information (Zhao et al. Citation2019).

Figure 1. The framework of our model.

3.1. Adaptive fusion strategy

During the training of deep learning models, high-resolution remote sensing image features encompass various levels of information, including low-level detail features and high-level semantic features (Ioffe and Szegedy Citation2015). While the FPN structure in the target detection model allows for the fusion of different scale features, it is important to note that features at different levels contribute differently to the task at hand. Simply fusing these features directly would introduce redundant and conflicting information, ultimately diminishing the model’s ability to express multiscale characteristics and compromising detection accuracy (Soudy et al. Citation2022). To address this, our article proposes an adaptive feature fusion strategy that enables the model to comprehend the significance of feature information at all levels. By employing this adaptive fusion approach, features at different levels can be gradually fused, with the fusion weights flexibly adjusted based on the specific input scenario (Liu et al. Citation2020). This framework enhances the model’s adaptability for feature fusion, thereby improving its overall performance.

The proposed adaptive fusion module in this article utilizes a horizontal contraction approach to aggregate cross-scale information. This method ensures consistency of different scale features for the same target, uncovers important information across scales and suppresses noise. By accurately expressing the characteristics of the input data, the adaptive fusion structure enhances the model’s robustness and generalization ability, as depicted in for a single adaptive feature fusion module.

Figure 2. Adaptive feature fusion module.

In traditional target detection, the use of a feature fusion layer with fixed weights results in all input images sharing the same fusion mode (He et al. Citation2015). However, this approach overlooks the impact of the target itself on feature fusion, thus limiting the detection performance. To address this limitation, adaptive adjustment of fusion weights is necessary, taking into account the input object scale to enhance the adaptability of feature fusion. Therefore, this article proposes an adaptive fusion module that incorporates these ideas. This module employs three adaptive fusion modules to aggregate into a cross-level feature structure, ensuring precise target detection. The fusion weight learning approach is primarily employed to implement this module. In this approach, two sets of feature data with varying amounts of information but the same size are inputted, and a set of feature fusion weights is outputted. illustrates the adaptive feature fusion structure devised in this article, while depicts the convolutional extraction feature structure of the model called yolov7.

Figure 3. Adaptive feature fusion structure and convolutional feature fusion structure.

In the adaptive feature fusion structure proposed in this article, the features { $V_{1},$ $V_{2}$ } $\in (C \times H \times W)$ of the same size are initially concatenated to obtain the feature $V_{3} \in (2 C \times H \times W) .$ Subsequently, global average pooling is applied to the feature $V$ to acquire the feature vector $S \in 2 C .$ The formula for global average pooling is denoted as follows: (1) $S (x) = \frac{1}{H \times W} \sum_{i = 1}^{W} \sum_{j = 1}^{H} {(O)}_{x} (i, j)$ (1)

In this context, $O (x)$ denotes the channel of $x \in [1, C],$ and O(x) represents the value of the xth channel after global average pooling. $S (x)$ serves as the input for two fully connected layers, which aim to learn the feature fusion weight $Z .$ To enhance the training stability of the fused weight learner, a softmax operation is conducted on the weight relationship matrix $Z$ in this study, which yields the normalized fused weights (Huang et al. Citation2018). Subsequently, the weight relationship matrix $Z$ is linearly fused with $V_{3}$ using the formula presented in EquationEq. (2)(2) $F = \sum_{i = 1}^{2} w_{i} P_{i}$ (2) : (2) $F = \sum_{i = 1}^{2} w_{i} P_{i}$ (2)

Finally, the fused feature $F \in (C \times H \times W)$ is obtained by inputting two sets of features of the same size to the fused weight learner. This process is repeated for three sets of fused weight learners in a sequential manner, forming the adaptive feature fusion structure.

3.2. Dual gate feature mechanism

In urban remote sensing image target detection, the occurrence of feature loss is a prominent issue. The presence of complex backgrounds and multiscale targets often result in redundant and conflicting information within the fused feature layer (Gomez-Chova et al. Citation2015). This, in turn, diminishes the model’s ability to express multiscale features and hampers the accuracy of detection results. To address these challenges, this study proposes a dual-attention gating mechanism that effectively filters out correlation conflict information affected by both inter-class similarity and intra-class diversity. This module generates adaptive weights in both the spatial and channel dimensions, effectively mitigating the noise caused by local information similarity. Moreover, it enhances the intra-class correlation to prevent the submersion of crucial target features amidst conflicting information. The resizing aspect of the dual gating mechanism is primarily accomplished through interpolation and convolution techniques, while the gating unit facilitates the selective enhancement or suppression of features.

The dual gating mechanism module, as illustrated in , combines channel attention, spatial attention, residual connections and transpose multiplication operations. Channel attention establishes correlations between feature channels without introducing additional learning parameters and adaptively enhances or suppresses channel features based on their correlations. The transpose operation captures relationships and structures among features, facilitating similarity calculations and cross-correlation analyses. Spatial attention emphasizes target location information and learns more effective feature regions. Residual connections retain intricate details and vital features from input data through feature reuse. The gating unit proposed in this article does not require any threshold setting, it is an input function-driven gating mechanism.

Figure 4. Double gate function mechanism module.

In the process of channel attention: first, the input feature map $F \in (C \times H \times W)$ is passed through the maximum pooling layer and the average pooling layer, respectively, to obtain finer features $a$ and $b \in (C \times 1 \times 1) .$ The calculation formulas for the maximum pooling layer and the average pooling layer are shown in EquationEqs. (3)(3) $S_{ij} = \frac{1}{c^{2}} (\sum_{i = 1}^{c} \sum_{j = 1}^{c} F_{ij}) + b_{2}$ (3) and Equation(4)(4) $S_{ij} = {max}_{i = 1, j = 1}^{c} (F_{ij}) + b_{2}$ (4) , respectively. (3) $S_{ij} = \frac{1}{c^{2}} (\sum_{i = 1}^{c} \sum_{j = 1}^{c} F_{ij}) + b_{2}$ (3) (4) $S_{ij} = {max}_{i = 1, j = 1}^{c} (F_{ij}) + b_{2}$ (4)

In EquationEqs. (3)(3) $S_{ij} = \frac{1}{c^{2}} (\sum_{i = 1}^{c} \sum_{j = 1}^{c} F_{ij}) + b_{2}$ (3) and Equation(4)(4) $S_{ij} = {max}_{i = 1, j = 1}^{c} (F_{ij}) + b_{2}$ (4) , $F_{ij}$ represent the values at the ith row and jth column of the feature map, where $b_{2}$ represents the bias and $c$ represents the subsampling pool size. Furthermore, the global features $a$ and $b$ are separately fed into a shared two-layer fully connected network for processing. The first layer has C/16 neurons and uses the Rectified Linear Unit (ReLU) as the activation function, while the second layer has C neurons. Next, the feature maps obtained from the fully connected layers are element-wise summed to obtain the feature map $c,$ as shown in EquationEq. (5)(5) $c = f_{16 / 1} (f_{C / 16} (u))$ (5) . (5) $c = f_{16 / 1} (f_{C / 16} (u))$ (5)

Afterwards, through the sigmoid activation operation, the final channel attention weight vector $d$ is generated. The sigmoid activation function is defined as follows: (6) $σ = sigmoid ((avgpool (F_{1}) + maxpool (F_{1})))$ (6)

The channel attention weight vector is multiplied with the input feature map $F_{1}$ to produce a weighted feature map. This process allows channel attention to adaptively learn the importance of each channel. By weighting the feature map, it can emphasize the feature expression of important channels, resulting in improved model performance and expression ability (Bochkovskiy et al. Citation2020).

During the transpose multiplication operation, the input feature $F_{1} \in (C \times H \times W)$ undergoes a dimension transformation (R) to obtain $e \in (C \times HW)$ and a dimension transformation with transpose (RT) to obtain the feature $f \in (HW \times C) .$ These transformation operations can be expressed using EquationEq. (7)(7) $R (e) = transpose (Dn, . . ., D 2, D 1)$ (7) . (7) $R (e) = transpose (Dn, . . ., D 2, D 1)$ (7)

Among them, $D_{n}$ represents a unidirectional dimension feature matrix. The relationship matrix $i \in (C \times C)$ between the feature channels is obtained by the multiplication of the feature matrices $e$ and $f .$ The normalized relationship matrix $j \in (C \times C)$ is obtained by applying a softmax operation to the relationship matrix $i .$ The formula is shown in EquationEq. (8)(8) $SoftM ax = \frac{exp (z_{ij})}{\sum_{j = 1}^{C} exp (z_{ij})}$ (8) . (8) $SoftM ax = \frac{exp (z_{ij})}{\sum_{j = 1}^{C} exp (z_{ij})}$ (8)

Among them, $Z_{ij}$ represents the values in the relationship matrix for i rows and j columns, indicating the importance of cross-scale features in fusion. The transpose multiplication operation method enables the model to capture important features at different spatial positions and adjust attention weights. This enhances the flexibility and accuracy in adjusting important features when dealing with scale changes.

In the process of spatial attention, the input feature $F \in (C \times H \times W)$ undergoes global average pooling and maximum pooling layers separately. The dimensions $m$ and $n$ are calculated using formulas (3) and (4). The resulting feature vectors have dimensions of H × W × 1. These feature vectors are then concatenated element-wise to obtain the fusion feature vector $o .$ The fused feature vector o is convolved with a 7 × 7 single-channel output, followed by the sigmoid function to map its values to [0,1]. The final attention-weighted feature map is obtained by multiplying the input feature $F$ with the weighted spatial feature weight vector $p \in (H \times W \times 1) .$ The application of spatial attention allows the model to prioritize channels relevant to the current task, extract more informative features and suppress irrelevant or redundant channels. This enhances the model’s ability to discriminate and express features (Zhu et al. Citation2019).

4. Experiment

This section begins by introducing the training dataset and providing details of the experimental implementation. Next, the accuracy performance of the method proposed in this article is evaluated and compared with several state-of-the-art algorithms. Finally, experiments were conducted to verify the effectiveness of the proposed method through ablation analysis.

4.1. Dataset

This section evaluates the proposed method using two benchmark datasets: (1) The NWPU VHR-10 spatial resolution dataset, which consists of 10 types of urban remote sensing objects with a spatial resolution ranging from 0.5 to 2 m (Wang et al. Citation2019). This dataset contains 650 images with objects and 150 background images across 10 categories. The dataset was extracted from Google Earth and released by Northwestern Polytechnical University in 2014. (2) The RSOD dataset, released by Wuhan University in 2015, has a spatial resolution ranging from 0.3 to 3 m and an image size of 920 × 1050 pixels (Zou and Shi Citation2017). It includes four types of targets: airplanes, playgrounds, overpasses and oil barrels, with a total of 976 images. A total of 6950 target regions of interest were annotated. These selected datasets in the article cover a wide range of categories and a large number of images. They exhibit variations in object sizes, differences in spatial resolution between categories and a combination of high category similarity and diversity.

This article evaluates the improved method proposed by utilizing precision rate (P), recall rate (R), F1 precision, average precision (AP), average precision (mAP), model size (FLOPs), calculation amount (Params) and frame per second (FPS). The performance of the detector is represented using the PASCAL VOC standard for calculating all evaluation indicators (Vicente et al. Citation2014). displays the architecture of the deep learning training.

Table 1. Experimental environment.

Download CSV Display Table

4.2. Qualitative analysis

displays image test results of the improved model proposed in this article on the NWPU VHR-10 and RSOD public datasets. shows the following observations: In (A) and (B), there are multiple interactions of the same category. In (C), the urban environment background is complex, with colour cast and uneven light. In (D), there are large remote sensing acquisition angles, low contrast and low texture features. (E) and (F) exhibit many objects, small in size and complex in shape. The results demonstrate that the improved model has achieved superior detection performance. The bounding box colours correspond to specific objects as follows: a blue bounding box represents a basketball court, a red bounding box represents an airplane, a green bounding box represents a baseball field, a purple bounding box represents a bridge, a pink bounding box represents a vehicle, a yellow bounding box represents a storage tank and an orange bounding box represents a vessel.

Figure 5. Image detection results of common datasets.

This article compares the improved model with YOLOv7 for urban fuzzy small target image detection on the NWPU VHR-10 dataset, as depicted in . The detection results of YOLOv7 are represented by labels 1, 2, 3, 7, 8 and 9, while labels 4, 5, 6, 10, 11 and 12 correspond to the detection results of the improved model proposed in this article. The test images consist of numerous small and blurry objects. A comparison between (1) and (4) reveals that the YOLOv7 model fails to detect the tennis court represented by the green frame, but successfully detects the horizontally oriented basketball court target. Similarly, a comparison between (2) and (5) shows that the baseball field detection error and the bridge enclosed in the green box are not detected. A comparison between (3) and (6) indicates that the YOLOv7 model fails to detect 3 out of the 18 harbours targets, whereas the improved method proposed in this article only misses one target. Additionally, the comparison between (7) and (10) reveals that the YOLOv7 model is unable to detect the tiny vehicle targets. A comparison between (8) and (11) demonstrates that the improved method proposed in this article achieves higher target accuracy than the YOLOv7 model. Furthermore, both methods successfully detect 7 aircraft targets. The YOLOv7 model fails to detect 5 tennis court targets and 1 basketball court target in (9) and (12). Consequently, the aforementioned results demonstrate that the proposed optimization strategy effectively enhances the detection performance of high-altitude remote sensing blurred objects.

Figure 6. Detection results for fuzzy small target images.

During the detection of urban high-altitude remote sensing targets, the presence of complex backgrounds, including cloud cover, atmospheric humidity and sunlight intensity, causes inevitable interpedently, a cloud and fog test set was formed by selecting 300 pictures from the two datasets and adjusting the light intensities accordingly. The comparison model received the testset. displays the cloud and fog test set.

Figure 7. Cloud and fog fuzzy test set display.

presents the detection results of both the YOLOv7 model and the improved model proposed in our research on the NWPU VHR-10 and RSOD datasets. These datasets contain objects that are prone to being lost or misdetected in complex backgrounds. In the figure, (a) represents the original image, while (b–d) show the image with the picture center as the center of the circle, atomization concentration being 0.01, 0.05 and 0.1 times, respectively. The first line represents the detection result of the YOLOv7 model, while the second line shows the detection result obtained using the improved model proposed in this research. In class (a), the original image is used for detection and the target ship is characterized by its small size and weak features. However, the YOLOv7 model exhibits false detection, identifying the seabed gully as an aircraft target. In class (b), the background features for detection are highly similar, making it challenging for the YOLOv7 model to detect weak bridge target features. In class (c), the cloud detection is performed at an atomization concentration of 0.05 times, in a background that is highly complex. Here, the seaport and small boat exhibit weak features. Class (d) exhibits cloud and fog background features that are highly similar. However, the strategy proposed in this research and the YOLOv7 model perform better in capturing target details and detecting the previously missing four tennis courts. These positive detection results validate the effectiveness of our improved model in detecting remote sensing objects under complex background interference.

Figure 8. Cloud and fog blur target detection results.

4.3. Quantitative analysis

To further analyse the accuracy of positional detection achieved by the enhanced model, we conducted experimental assessments using two publicly available datasets: NWPU VHR-10 and RSOD. The analysis was performed with a 95% confidence interval. For comparative purposes, we employed one-stage networks, including RetinaNet, M2Det, YOLOv5, YOLOv7 and YOLOv8, as well as two-stage networks such as EfficientDet, MFE and Faster RCNN. two generic models were introduced as benchmarks. and present the accuracy results on the NWPU VHR-10 dataset, whereas and display the accuracy results on the RSOD dataset.

Table 2. Accuracy of the classes detected by the one-stage network on NWPUVHR-10.

Download CSV Display Table

Table 3. Accuracy of the classes detected by the two-stage network on NWPUVHR-10.

Download CSV Display Table

Table 4. Accuracy of the classes detected by the one-stage network on RSOD.

Download CSV Display Table

Table 5. Accuracy of the classes detected by the two-stage network on RSOD.

Download CSV Display Table

Synchronous experiments were conducted using the NWPU VHR-10 dataset to compare with other one-stage object detection methods. Through the comparisons presented in , the proposed enhancement strategy in this study demonstrates consistency with other methods in recognizing large urban targets such as basketball factories and oil tanks. However, it outperforms them in identifying small targets like harbors, ground field and bridges, achieving accuracies of 94%, 97% and 98%, respectively.

The NWPU VHR-10 dataset was utilized for synchronized experiments, comparing with other two-stage target detection methods. As depicted in , the proposed enhancement strategy in this study exhibits superior target identification across various sizes, with nearly all instances of airplanes, baseball stadiums and bridges being detected.

Synchronized experiments were conducted using the RSOD dataset, comparing with other one-stage target detection methods. Through the comparisons presented in , the enhancement strategy proposed in this study achieved the best detection performance on the mAP of the RSOD dataset, reaching 96%. The improved model demonstrated significant superiority in detecting airplane and oil tank targets compared to other one-stage detectors, achieving optimal performance with accuracy improvements of 0.05% and 0.02%, respectively.

The RSOD dataset was used to conduct synchronous experiments and compare with other two-stage target detection methods. Through the comparison shown in , the enhancement strategies proposed in this study outperform the other three two-stage network models. Experimental results on the NWPU VHR-10 and RSOD datasets indicate that the proposed improvement algorithm achieved higher classification accuracy and effectively utilized multi-object feature information to enhance scene target recognition performance.

4.4. Ablation experiment

4.4.1. Dual-gated feature units

This article introduces the addition of a double gating unit module after the multibranch stacking of the FPN network layer in YOLOv7.A series of ablation experiments were conducted on the WPU VHR-10 and RSOD datasets. The Channel Attention Operation (CA), Spatial Attention Operation (SA) and Transpose Operation (RC) within the dual-gated unit module are evaluated individually. The outcomes are displayed in .

Table 6. Double gated feature unit ablation experiment.

Display Table

During the verification process of the NWPU VHR-10 dataset, the YOLOv7 model achieved a Map of 90.56%, an F₁ score of 79.54% and utilized 106.47 M FLOPs. The inclusion of a dual feature gating mechanism in the neck feature fusion network layer resulted in an increase of 3.89% for the mAP, 6.95% for F₁ and only 9.44 M additional FLOPs. Furthermore, when the gating unit performed only channel feature fusion, the mAP was 1.26% higher than that of the base model. Similarly, the gating unit performing only spatial feature fusion resulted in a 1.75% improvement in mAP compared to the base model. On the other hand, when the gating unit solely used transpose operation fusion features, the mAP decreased by 0.31% relative to the base model.

During the verification process of the RSOD dataset, the YOLOv7 model demonstrated a Map of 93.01%, an F₁ score of 87.68% and utilized 106.47 M FLOPs. Additionally, the inclusion of a dual feature gating mechanism in the neck feature fusion network layer resulted in a 2.22% increase in mAP and a 6.63% increase in F₁. Furthermore, when the gating unit exclusively performed channel feature fusion, the mAP was 0.1% higher than that of the base model. Similarly, the gating unit performing only spatial feature fusion yielded a 0.64% improvement in mAP compared to the base model. However, when the gating unit solely used transpose operation fusion features, the mAP decreased by 1.63% relative to the base model.

displays the base model accuracies for the NWPU VHR-10 and RSOD datasets, which are 90.56% and 93.01%, respectively. Following the introduction of the dual feature gating mechanism, these accuracies increased by 3.89% and 2.22%, respectively. However, there was also an associated 8.1% increase in FLOPs. This can be attributed to the incorporation of double-gated feature units after multibranch stacking, resulting in a higher number of network layers and increased computational complexity of the overall model. To summarize, in the case of the dual feature gating mechanism, the most noticeable improvements were observed in channel attention and spatial attention. Furthermore, the transpose operation effectively contributed to the overall gain. Consequently, the feature-enhanced gating module demonstrates the capability to significantly improve accuracy without compromising operational efficiency.

5. Conclusion

This article presents a novel remote sensing object detection algorithm that incorporates a dual-attention gating mechanism and adaptive fusion strategy. The proposed approach enhances the YOLOv7 model to address two significant challenges encountered in remote sensing object detection in high-altitude cities. First, after PAnet feature fusion, conflicts arise between multiscale feature information, leading to the submergence of target information and feature loss. Second, due to the minimal difference between urban target objects and ground backgrounds, the problem of losing detailed target object information arises during multiscale feature fusion. The adaptive fusion strategy and module effectively mitigate conflicts in multiscale feature fusion by suppressing irrelevant information while preserving target-related structure.

Experiments conducted on the RSOD and NWPU VHR-10 public datasets validate the algorithm proposed in this article, demonstrating significant advantages in both detection accuracy and efficiency when compared with current mainstream object detectors. Moreover, our algorithm exhibits faster training speed and utilizes fewer parameters, enabling efficient detection of small targets in remote sensing applications. Nevertheless, there remain opportunities for further improvement in the complexity and computational efficiency of our model. Moving forward, our focus will be on making lightweight optimizations to the model while ensuring detection accuracy, thus, facilitating its application in the autonomous object detection of remote sensing satellites.

Authors’ contributions

Luxuan Bian performed the data analysis; Bo Li performed the formal analysis; Jue Wang performed the validation; Zijun Gao wrote the manuscript.

Abbreviations
AlexNet	=	ImageNet classification with deep convolutional neural networks
EfficientDet	=	scalable and efficient object detection; F₁: a metric to evaluate the performance of a dichotomous model
Faster RCNN	=	towards real-time object detection with region proposal networks
FLOPs	=	an indicator of the number of floating-point operations in a computer program
FPN	=	feature pyramid networks for object detection
PASCAL VOC	=	object detection dataset
RCNN	=	rich feature hierarchies for accurate object detection and semantic segmentation
RetinaNet	=	focal loss for dense object detection
YOLOv7	=	trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Acknowledgements

We would like to express our gratitude for the encouragement and support from our co-authors. The ‘DIOR’ large-scale benchmark dataset proposed by Han Junwei’s team at Xi’an University of Technology was an extremely important resource that provided valuable information. We also thank MathWorks for providing the excellent MATLAB software package, which is widely available and comes with documentation that helped us complete this work under the Windows operating system.

Disclosure statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Data availability statement

The NWPU VHR-10 dataset is freely available as follows, https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset. The RSOD dataset are freely available as follows: http://www.escience.cn/people/gongcheng/DIOR.html. And the relabelled images and codes that support the findings of this study are available from the corresponding author, upon reasonable request.

References

Ahern J. 2011. From fail-safe to safe-to-fail: sustainability and resilience in the new urban world. Landscape Urban Plann. 100(4):341–343. doi:10.1016/j.landurbplan.2011.02.021.
Web of Science ®Google Scholar
Bochkovskiy A, Wang CY, Liao HYM. 2020. Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
Google Scholar
Ceccato VA, Snickars F. 2000. Adapting GIS technology to the needs of local planning. Environ Plann B Plann Des. 27(6):923–937. doi:10.1068/b26103.
Web of Science ®Google Scholar
Chouhan SS, Kaul A, Singh UP. 2019. Image segmentation using computational intelligence techniques. Arch Computat Methods Eng. 26(3):533–596. doi:10.1007/s11831-018-9257-4.
Web of Science ®Google Scholar
Del Prete R, Graziano MD, Renga A. 2021. RetinaNet: a deep learning architecture to achieve a robust wake detector in SAR images. 2021 IEEE 6th International Forum on Research and Technology for Society and Industry (RTSI), Naples, Italy. IEEE. p. 171–176. doi:10.1109/RTSI50628.2021.9597297.
Google Scholar
Girshick R. 2015. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision. p. 1440–1448.
Google Scholar
Gomez-Chova L, Tuia D, Moser G, Camps-Valls G. 2015. Multimodal classification of remote sensing images: a review and future directions. A review and future directions. Proceedings of the IEEE 103(9):1560–1584.
Google Scholar
Hazaymeh K,Almagbile Ali,Alsayed A. 2022. A cascaded data fusion approach for extracting the rooftops of buildings in heterogeneous urban fabric using high spatial resolution satellite imagery and elevation data. Egypt J Remote Sens Space Sci. 26(1):245–252.
Web of Science ®Google Scholar
He K, Zhang X, Ren S, Sun J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 37(9):1904–1916. doi:10.1109/TPAMI.2015.2389824.
PubMed Web of Science ®Google Scholar
He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. Computer Vision–ECCV 2016: 14th European Conference, Proceedings, Part IV 14, Oct 11–14; Amsterdam, The Netherlands: Springer International Publishing. p. 630–645.
Google Scholar
Huang G, Liu S, Van der Maaten L. 2018. Condensenet: an efficient densenet using learned group convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. p. 2752–2761.
Google Scholar
Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning.p. 448–456.
Google Scholar
Krizhevsky A, Sutskever I, Hinton GE. 2017. Imagenet classification with deep convolutional neural networks. Commun ACM. 60(6):84–90. doi:10.1145/3065386.
Web of Science ®Google Scholar
Li J, Huang X, Tu L, Zhang T, Wang L. 2022. A review of building detection from very high resolution optical remote sensing images. GISci Remote Sens. 59(1):1199–1225. doi:10.1080/15481603.2022.2101727.
Web of Science ®Google Scholar
Li K, Wan G, Cheng G, Meng L, Han J. 2020. Object detection in optical remote sensing images: a survey and a new benchmark. ISPRS J Photogramm Remote Sens. 159:296–307. doi:10.1016/j.isprsjprs.2019.11.023.
Web of Science ®Google Scholar
Liu P, Zhang Z, Meng Z, Gao N. 2020. Joint attention mechanisms for monocular depth estimation with multi-scale convolutions and adaptive weight adjustment. IEEE Access. 8:184437–184450.
Web of Science ®Google Scholar
Maguire DJ. 1991. An overview and definition of GIS. Geogr Inf Syst. 1(1):9–20.
Google Scholar
Reis D, Kupec J, Hong J, Daoudi A. 2023. Real-Time Flying Object Detection with YOLOv8.arXiv preprint: arXiv:2305.09972.
Google Scholar
Rottensteiner F, Sohn G, Gerke M, Wegner JD, Breitkopf U, Jung J. 2014. Results of the ISPRS benchmark on urban object detection and 3D building reconstruction. ISPRS J Photogramm Remote Sens. 93:256–271. doi:10.1016/j.isprsjprs.2013.10.004.
Web of Science ®Google Scholar
Soudy M, Afify Y, Badr N. 2022. RepConv: A novel architecture for image scene classification on Intel scenes dataset. Int J Intell Computing Inf Sci. 22(2), 63–73.
Google Scholar
Sun X, Wu P, Hoi SC. 2018. Face detection using deep learning: an improved faster RCNN approach. Neurocomputing. 299:42–50. doi:10.1016/j.neucom.2018.03.030.
Web of Science ®Google Scholar
Tan M, Pang R, Le QV. 2020. Efficientdet: scalable and efficient object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea (South). p. 10781–10790.
Google Scholar
Vicente S, Carreira J, Agapito L, Batista J. 2014. Reconstructing PASCAL VOC. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. p. 41–48.
Google Scholar
Wang C, Bai X, Wang S, Zhou J, Ren P. 2019. Multiscale visual attention networks for object detection in VHR remote sensing images. IEEE Geosci Remote Sensing Lett. 16(2):310–314. doi:10.1109/LGRS.2018.2872355.
Web of Science ®Google Scholar
Wang CY, Bochkovskiy A, Liao HYM. 2022. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Vancouver, Canada (pp. 7464–7475).
Google Scholar
Zhang L, Li H, Wang P, Yu X. 2013. Detection of regions of interest in a high-spatial-resolution remote sensing image based on an adaptive spatial subsampling visual attention model. GISci Remote Sens. 50(1):112–132. doi:10.1080/15481603.2013.778553.
Web of Science ®Google Scholar
Zhang Z, Liu F, Zhao X, Wang X, Shi L, Xu J, Yu S, Wen Q, Zuo L, Yi L, et al. 2018. Urban expansion in China based on remote sensing technology: a review. Chin Geogr Sci. 28(5):727–743. doi:10.1007/s11769-018-0988-9.
Web of Science ®Google Scholar
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H. 2019. M2det: a single-shot object detector based on multi-level feature pyramid network. AAAI. 33(01):9259–9266. Honolulu, Hawaii, USA. (doi:10.1609/aaai.v33i01.33019259).
Google Scholar
Zhu X, Cheng D, Zhang Z, Lin S, Dai J. 2019. An empirical study of spatial attention mechanisms in deep networks. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada. p. 6688–6697.
Google Scholar
Zou Z, Shi Z. 2017. Random access memories: a new paradigm for target detection in high resolution aerial remote sensing images. IEEE Trans Image Process. 27(3):1100–1111. doi:10.1109/TIP.2017.2773199.
Web of Science ®Google Scholar

Urban object detection algorithm based on feature enhancement and progressive dynamic aggregation strategy

Abstract

1. Introduction

2. Related work