672
Views
0
CrossRef citations to date
0
Altmetric
Research Article

SDMSEAF-YOLOv8: a framework to significantly improve the detection performance of unmanned aerial vehicle images

ORCID Icon, , , , , & show all
Article: 2339294 | Received 27 Nov 2023, Accepted 01 Apr 2024, Published online: 10 Apr 2024

Abstract

The detailed, high-resolution images captured by drones pose challenges to target detection algorithms with complex scenes and small-sized targets. Moreover, targets in unmanned aerial vehicle images are usually affected by factors such as viewing perspective, occlusion, and light, which increase the difficulty of target detection. In response to the above issues, we propose an improved SDMSEAF-YOLOv8 for target detection based on YOLOv8, combined with a Bi-directional Feature Pyramid Network, to improve the sensing ability of the model for multiscale targets. A Space-to-depth layer replaces the traditional strided convolution layer to enhance the extraction of fine-grained information and small-sized target features. A Multi-Separated and Enhancement Attention module enhances the feature learning ability of the occluded target region, thus reducing missed and false detections. Four detection heads are employed for tiny target detection, each responsible for different size ranges, so as to improve the accuracy and robustness of small target detection. The conventional non-maximum suppression algorithm is improved, so as to reduce the problem of missed detections under a densely occluded scene by setting the attenuation function to adjust the confidence of the treated box based on the overlap between it and the highest-scoring box. Experiments demonstrate that the accuracy of SDMSEAF-YOLOv8 exceeds that of state-of-the-art models on the VisDrone2019-DET-val dataset, with a mAP of 42.9% at 640-pixel resolution, 14.8% over the baseline YOLOv8-x algorithm model, and 6.0% over the known state-of-the-art Fine-Grained Target Focusing Network model and with twice as fast detection.

1. Introduction

The field of drone application is quite broad (Yu et al. Citation2022; Kou et al. Citation2021; Li et al. Citation2022), and the future is promising. Innovation is driving high-quality development, and digital transformation has become a trend. Combining unmanned aerial vehicle (UAV) images with deep learning target detection has become the main direction of research (Zhang et al. Citation2023; Pham et al. Citation2020; Xi et al. Citation2022). However, UAV images usually contain a large number of background objects and few target objects, and there are problems such as small target objects, partial occlusion, and overlap, which make it difficult to detect the object of interest.

With the improvement of UAV image resolution and target detection algorithms, detection accuracy has improved. Breakthrough innovations such as Region-based Convolutional Neural Networks (R-CNN) (Girshick et al. Citation2014; Girshick Citation2015) introduced a two-stage detection pipeline for candidate box generation and region feature extraction. The You Only Look Once (YOLO) family (Redmon et al. Citation2016; Redmon and Farhadi Citation2018; Bochkovskiy et al. Citation2020; Wang et al. Citation2022) transforms the target detection task into a single forward propagation regression problem. Compared to the conventional two-stage method, the YOLO family can achieve real-time target detection, which greatly improves processing speed while guaranteeing the precision of identification.

YOLO-series algorithms have been supported by researchers and developers due to their real-time performance, simple and efficient architecture, multi-task and open-source community support, and the addition of many new technologies. As a result, many new versions (Redmon and Farhadi Citation2017; Zhu et al. Citation2021; Gao et al. Citation2021) have been developed, which are widely used in many fields. Algorithms such as EfficientNet (Tan and Le Citation2019), Single Shot MultiBox Detector (SSD) (Liu et al. Citation2016), and RetinaNet (Lin et al. Citation2017) are widely applied in fields such as object recognition and medical image analysis.

Researchers have used improved feature extraction networks (Hu et al. Citation2018), loss functions (Rezatofighi et al. Citation2019), and target detection to decouple the head and slice high-resolution UAV images (Zhu et al. Citation2021), so as to improve occlusion, overlap, or small target detection problems, but focusing on a single problem will lead to limited detection performance.

We analyzed the images in the VisDrone2019 dataset in and found that the YOLOv8-x baseline selectively neglects small target objects and does not localize to them, resulting in a severe loss of precision, especially when there is occlusion. The main situations are as follows:

Figure 1. Sources of occlusion. Blue boxes represent mutual occlusion between detection objects, red boxes indicate the occlusion of detection objects by trees or vegetation, and green boxes signify occlusion due to objects being located within buildings or in close proximity to identifiers.

Figure 1. Sources of occlusion. Blue boxes represent mutual occlusion between detection objects, red boxes indicate the occlusion of detection objects by trees or vegetation, and green boxes signify occlusion due to objects being located within buildings or in close proximity to identifiers.
  1. Mutual occlusion between objects: Blocking can occur between pedestrians and bicycles or vehicles;

  2. An object located in a building or obscured near identifiers: Part or all of a pedestrian or vehicle is obscured by a building;

  3. An object obscured by a tree or vegetation: Part or all of a pedestrian or vehicle is obscured by a tree, bush, or other vegetation.

When the drone’s perspective is long, the detected target becomes small, and the risk of what we call distant perspective occlusion can increase greatly. To address this problem, we propose an improved network model, SDMSEAF-YOLOv8.

The main contributions of this paper are summarized as follows:

  1. In combination with the Bi-directional Feature Pyramid Network (BiFPN) structure, feature maps are iteratively collected through multiple upsampling and downsampling, and bidirectional information transfer and fusion are conducted directionally between feature maps at different scales, which effectively integrates feature information and improves the accuracy and robustness of target detection;

  2. The Space-to-depth (SPD) module replaces traditional strided convolution and pooling operations to downsample feature maps, reducing the loss of fine-grained information. While retaining the detailed information, better feature extraction is achieved;

  3. The Multi-Separated and Enhancement Attention Module (MultiSEAM) captures feature relations of different scales, directions, or semantics by using multiple separated attention heads focusing on different feature subspaces, to enhance feature extraction and improve the accuracy and robustness of target detection;

  4. Increasing the target detection head and extending the multitasking capability of the YOLOv8 framework enable simultaneous detection of multiple targets, improve the efficiency and accuracy of detection, and better focus on multiple targets at different scales;

  5. By introducing an attenuation function to adjust the confidence level according to the degree of overlap, the improved non-maximum suppression (NMS) algorithm effectively solves the miss problem in dense occlusion and improves target detection performance.

2. Related work

The challenges of UAV image detection include the detection of numerous small targets, dealing with occlusion, and recognizing targets in complex backgrounds. These factors often result in low-quality feature extraction, which affects detection performance. These issues have been addressed through advanced modules and algorithms. We review work in one-stage and two-stage target detection and the implementation of YOLO-series algorithms in remote sensing.

2.1. Two-stage target detection algorithms

In the field of computer vision, progress in target detection algorithms reflects a development from generating preliminary candidate regions to accurate classification and regression analysis. Traditional two-stage detection prioritizes accuracy but suffers from slow detection, strict computing resource requirements, and difficulties with large-scale datasets. Since Girshick et al. introduced the R-CNN in 2014 (Girshick et al. Citation2014), subsequent work such as Fast R-CNN (Girshick Citation2015), and Cascade R-CNN (Cai and Vasconcelos Citation2019) has built upon this framework, optimizing the detection process and enhancing speed and accuracy. These algorithms emphasize feature extraction and generate candidate boxes through mechanisms like the Region Proposal Network (RPN) (Ren et al. Citation2015), thus demonstrating the central role of performance optimization in target detection tasks. By simplifying the model structure and implementing cascade strategies, they have notably improved computational accuracy and the handling of large-scale datasets. Although two-stage detection algorithms can achieve high accuracy, they remain noticeably slower than one-stage algorithms.

2.2. Single-stage target detection

Single-stage target detection algorithms have undergone rapid development recently, supporting end-to-end training by integrating candidate box generation and target classification, significantly improving processing speed and accuracy. The YOLO-series and other single-stage target detection algorithms focus on real-time performance and end-to-end training. Each version, from the simple regression model of YOLOv1 to the anchor-free architecture of YOLOv8, has introduced innovations in network structure, detection mechanism, and performance optimization. From YOLOv1 (Redmon et al. Citation2016; Redmon and Farhadi Citation2017) to YOLOv3 (Redmon and Farhadi Citation2018), anchor boxes and multiscale detection have been introduced to enhance the detection of targets with different shapes and scales. YOLOv4 (Bochkovskiy et al. Citation2020) and YOLOv5 (G. Jocher Citation2020) further optimized feature extraction and data enhancement to increase accuracy and robustness. YOLOv6 (Li et al. Citation2022) and YOLOv7 (Wang et al. Citation2022) introduced new loss functions and label allocation methods to further improve detection accuracy. The latest version, YOLOv8 (G. Jocher et al. Citation2023), better balances speed and accuracy through its anchor-free architecture and decoupling head design. Algorithms such as SSD (Liu et al. Citation2016) and EfficientNet (Tan and Le Citation2019) have also made progress in multi-level prediction and performance optimization by introducing default boxes and BiFPN. Other algorithms, including PP-YOLO (Long et al. Citation2020), Scaled-YOLOv4 (Wang et al. Citation2021), YOLOX (Ge et al. Citation2021), DAMO-YOLO (Xu et al. Citation2022), and YOLO-NAS (R. team Citation2023), have shown significant progress in real-time processing, computing efficiency, and accuracy improvement. PP-YOLO, which utilizes the PaddlePaddle platform (Ma et al. Citation2019) and ResNet50 architecture, achieved an Average Precision (AP) of 45.9% and a speed of 73 FPS on the MS COCO dataset using an NVIDIA V100.

2.3. Applications of YOLO algorithm family in remote sensing

To overcome the challenges associated with remote sensing image detection, the YOLO algorithm family has seen a series of improvements. These include YOLO-SE (Wu and Dong Citation2023), MSA-YOLO (Su et al. Citation2023), FE-YOLO (Xu and Wu Citation2021), RSI-YOLO (Li et al. Citation2023), U-YOLO (Guo et al. Citation2023), and YOLO-Class (Liu et al. Citation2023). By integrating multi-resolution (MR) technology and advanced strategies such as the Squeeze-and-Excitation module, Efficient Multi-Scale Attention module, multi-receiver field technology, feature enhancement technology, the adaptive fusion module, sub-pixel convolution technology, the varifocal loss function, the YOLO algorithm has seen significant improvement in the detection accuracy of small-sized and medium-sized targets in remote sensing images. These strategies optimize the fusion of high-level semantic information and low-level spatial information, enhancing the accuracy of target classification and location, and can effectively meet the challenges presented by high-density and large-coverage images captured by UAVs. This is achieved through adaptive recalibration of the feature response between channels, more robust bounding box regression, and an improved target detection framework.

In this paper, we have enhanced the module and post-processing algorithm proposed in over 20 papers (Zhu et al. Citation2023; Yu et al. Citation2023; Wang et al. Citation2023, etc.) across various domains and applied it to UAV image detection. Using YOLOv8 as the foundational model, we conducted experiments and assessed that the BiFPN structure in EfficientNet, the SPD module in Space to Depth Convolution (SPD-Conv), the MultiSEAM module in YOLOv2-Face, and Soft-NMS post-processing all contribute significantly to improving the accuracy and efficiency of remote sensing image target detection. During the refinement of these modules and post-processing algorithms, we emphasized the enhancement of the ability to address distant perspective occlusion in UAV images without compromising processing speed, so as to enhance detection accuracy, robustness, and real-time performance.

3. Methods

3.1. Overview of SDMSEAF-YOLOv8

To solve the problem that a small target object cannot be fully detected because of distant perspective occlusion, we propose the SDMSEAF-YOLOv8 algorithm, whose overview is shown in , which effectively improves the accuracy and robustness of target detection in UAV images.

Figure 2. Improved SDMSEAF-YOLOv8 algorithm incorporates BiFPN structure, with SPD module and Multi-Separated and Enhancement Attention Module, which improves the NMS algorithm. Floor: level of each module; repeats: number of times a module is repeated. Conv: Convolution. SPD: Space to depth. MultiSEAM: Multi-Separation and Enhancement Attention Module. C2f: Cross stage partial bottleneck with two convolutions. Upsample: This layer is employed to enlarge the input data to match the dimensions of the target layer. Concat: This is used to connect two or more arrays. Detect_Head: This is used to identify and locate the location of the target and generate the target detection box.

Figure 2. Improved SDMSEAF-YOLOv8 algorithm incorporates BiFPN structure, with SPD module and Multi-Separated and Enhancement Attention Module, which improves the NMS algorithm. Floor: level of each module; repeats: number of times a module is repeated. Conv: Convolution. SPD: Space to depth. MultiSEAM: Multi-Separation and Enhancement Attention Module. C2f: Cross stage partial bottleneck with two convolutions. Upsample: This layer is employed to enlarge the input data to match the dimensions of the target layer. Concat: This is used to connect two or more arrays. Detect_Head: This is used to identify and locate the location of the target and generate the target detection box.

Compared with the YOLOv8 baseline, our model focuses more on small objects and occluded objects at a distance. In , red dots represent the degree of model focus. shows the thermal map detected by the YOLOv8 baseline, where the red area focuses on objects at close range and ignores objects at a distance. shows our SDMSEAF-YOLOv8 model, where the red area focuses on objects at close range and those at a distance and focuses more accurately on the latter, which is conducive to the learning of later features and improves the accuracy of detection.

Figure 3. Heat map visualization of feature maps. (a) YOLOv8-x baseline; (b) SDMSEAF-YOLOv8. The YOLOv8 baseline model has difficulty focusing on small targets and obscured objects. Our SDMSEAF-YOLOv8 model can better detect small objects and occluded objects at a distance.

Figure 3. Heat map visualization of feature maps. (a) YOLOv8-x baseline; (b) SDMSEAF-YOLOv8. The YOLOv8 baseline model has difficulty focusing on small targets and obscured objects. Our SDMSEAF-YOLOv8 model can better detect small objects and occluded objects at a distance.

3.2. Bi-directional feature pyramid network

To enhance the precision and reliability of target detection in UAV images, this study employs the BiFPN structure from EfficientNet, which effectively merges multiscale feature maps by means of bidirectional parallel paths, facilitating the exchange of information between different scale features and aiding in the identification of smaller targets. The inclusion of four convolution layers in the network head standardizes the size of varying scales of feature maps, thereby resolving the issue of size discrepancy during feature map fusion. This increases detection accuracy and enhances model stability. Our model surpasses the YOLOv8 baseline, which combines two scales of feature maps, by amalgamating three scales of feature maps to capture target information. This enables more efficient multiscale target detection and notably improves the target detection performance of UAV images.

In this paper, referring to the features of the BiFPN structure, continuous four convolution layers are added in the head section to convert the feature map to a uniform size, so that feature maps at different scales can better perform the concat operation, improving the accuracy and stability of target detection in UAV images. In contrast, the YOLOv8 baseline model combines only two scales.

3.3. Space to depth

SPD-Conv was developed to address the problem of loss of detailed information and imprecise feature representation caused by the use of strided convolution or pooling layers in CNN architectures. The core formula (Sunkara and Luo Citation2022) is presented as EquationEquation (1). SPD-Conv is composed of an SPD layer and a non-branching Convolution (Conv) layer, to replace each convolution step and pooling layer and achieve feature downsampling.

SPD layer: The input feature map is SPD transformed and segmented into several sub-feature maps, each corresponding to a small area of the original feature map, whose details are therefore well-preserved.

Non-strided convolution layer: Following SPD, non-strided (stride = 1) convolution is performed, which preserves the spatial resolution of feature maps while allowing models to learn more accurate feature representations. Non-strided convolution can capture fine-grained information more effectively, contributing to improved model performance and accuracy. (1) f0,0=X [0:S:scale,0:S:scale]f1,0=X [1:S:scale,0:S:scale]fscale1,0=X [scale1:S:scale,0:S:scale]f0,1=X [0:S:scale,1:S:scale]f1,1=X [1:S:scale,1:S:scale]fscale1,1=X [scale1:S:scale,1:S:scale]f0,scale1=X [0:S:scale,scale1:S:scale]f1,scale1=X [1:S:scale,scale1:S:scale]fscale1,scale1=X [scale1:S:scale,scale1:S:scale](1) where  fx,y represents the element in the x-th row and y-th column of the matrix, the scale factor determines the number of elements skipped in the index to reach the desired row or column, and X represents the input tensor. The operation 0:S:scale is a slice operation, where 0 is the starting index of the slice and S is the end index; this typically indicates the array or matrix size.

In this work, a three-dimensional feature map X is input to the SPD layer, with dimensions RS×S×C, where S×S is the spatial resolution and C is the number of channels. Based on the scale, X is divided into four smaller pieces fx,y, each of size 2×2, which, during segmentation, retains the original channel dimension C, and the feature map  X1  is generated as the output. EquationEquation (1) contains detailed information regarding this process. Then,  fx,y  is rearranged to match the depth dimension, which reduces the spatial resolution to 1/4 of the original, while increasing the number of channels from C to 4 C. The output dimension is represented by the new feature graph X2, with dimensions RS2×S2×4C. The process is represented as Algorithm 1.

Algorithm 1:

Enhancing Feature Representation with Space to depth

Input: feature map X∈RS×S×C; scale factor

Output: feature map X2RS/scale×S/scale×Cscale2

1. Split feature map: fx,y=X [x::scale, y::scale]→X1RS2×S2×4C

2. Reorder space to depth:X2=reorder (fx,y)→RS/scale×S/scale×Cscale2

The performance of SDMSEAF-YOLOv8 on low-resolution images and small objects is improved by merging the SPD module, thereby enhancing the fine-grained feature learning of small target objects, as shown in .

Figure 4. Overview of SPD layer procedure: Spatial dimensions S of input are halved, while channel dimension C1 is quadrupled to produce feature map X1. Four sub-feature maps within X1, each representing a distinct quadrant, are concatenated along channel dimension to form X2, resulting in final feature map with C2 channels.

Figure 4. Overview of SPD layer procedure: Spatial dimensions S of input are halved, while channel dimension C1 is quadrupled to produce feature map X1. Four sub-feature maps within X1, each representing a distinct quadrant, are concatenated along channel dimension to form X2, resulting in final feature map with C2 channels.

3.4. Multi-separated and enhancement attention module

MultiSEAM is an improved CNN module with an adaptive attention mechanism. Compared to the Separation and Enhancement Attention Module (SEAM) (Yu et al. Citation2022), MultiSEAM effectively enhances generalization performance by introducing SEAM at different scales. It extracts and assigns weights to features at small-scale, mesoscale, and large-scale levels and combines these features to enhance generalization performance. Compared to fixed-weight feature fusion in the seam, MultiSEAM leverages adaptive average pooling and a fully connected layer. Adaptive average pooling automatically adjusts the weights, and the fully connected layer further strengthens feature fusion capability, allowing for more flexible integration of features from various scales.

In this study, MultiSEAM aggregates the features {C2, C3, C4, C5} from the convolution layer, with dimension Rwi×hi×ci. Convolution kernel features with different expansion rates (d∈{1,2,3}) are used to process and capture multiscale spatial information. Multiscale feature Pd  is obtained by reducing the spatial dimension of feature  Fd  through pooling. Different pooling features Pd are combined to synthesize a multiscale feature map, Fconcat. The combined feature is merged with the original feature R to create aggregated feature M.

Algorithm 2:

Feature Aggregation using MultiSEAM

Input: Features {C2, C3, C4, C5} from convolution layers, where Ci∈Rwi×hi×ci

Output: Aggregated features from MultiSEAM, denoted as R∈Rwi×hi×ci

1. Convolve with variable dilation: Convdilate=d (R)→Fd, d∈{1,2,3}

2. Pool for spatial reduction: Pool (Fd)→Pd

3. Concatenate multiscale features: Concat ([F1,F2,F3])→Fconcat

4. Merge concatenated features: Merge([Fconcat,R]) → M

The MultiSEAM is added before feature map output detection, as shown in . To enhance feature representation and detection, the adaptive attention mechanisms can be weighted according to the importance of scale features. This enhances the ability to accurately locate and detect small target objects in UAV images.

Figure 5. MultiSEAM Structure. Three submodules are created using the DCovN function, resulting in different parameter values. An adaptive average pooling layer and a fully connected layer containing two linear layers and two activation functions are also defined. The DCovN function is convolved via Conv2d, normalized using the SiLU activation function and BatchNorm2d, and repeated for the specified depth times.

Figure 5. MultiSEAM Structure. Three submodules are created using the DCovN function, resulting in different parameter values. An adaptive average pooling layer and a fully connected layer containing two linear layers and two activation functions are also defined. The DCovN function is convolved via Conv2d, normalized using the SiLU activation function and BatchNorm2d, and repeated for the specified depth times.

3.5. Soft-NMS algorithm

Traditional NMS works by removing redundant overlapping detection frames and retaining only the optimal detection results. This algorithm sorts the confidence scores of all detection boxes, starting with the highest-scoring detection box, and represses the box with the highest overlap, giving it a score of zero. As a result of this, this algorithm can easily remove the best results and retain the worst results in cases with excessive overlap. To address this problem, Soft-NMS (Bodla et al. Citation2017) employs a continuous decay function to update the score of the detection box, allowing for the more accurate retention of overlapping detection boxes and avoiding missed detection.

Soft-NMS reduces the omission of correct detections by dynamically adjusting the confidence scores of candidate bounding boxes. This enhances the accuracy and robustness of target detection. The detected bounding box B is input, with a confidence score si, and filtered using the argmaxB function. The bounding box with the highest score, bmax, is selected and updated in the list. The confidence score is dynamically decreased based on the intersection and union ratio of the remaining bounding box and datum box θ, for which a linear or Gaussian decay function is used. The bounding box list is updated and reordered according to the adjusted confidence score si. This iterative process continues until all bounding boxes are evaluated and the filtered bounding box set D is output. Unlike traditional NMS, which directly removes overlapping boxes, Soft-NMS preserves possible correct detections by finely adjusting the confidence score, which improves the detection results. The procedure is presented as Algorithm 3.

In UAV image detection, traditional NMS may overly suppress overlapped detection boxes and cause missed detections, while Soft-NMS significantly improves average accuracy, detecting multiple overlapped objects.

Algorithm 3:

Soft-NMS for Object Detection

Input: Detected bounding boxes B with scores si, IoU threshold θ

Output: Filtered bounding boxes D

1. Select the highest scoring bounding box: bmax = argmaxB(si),D+{bmax}→D

2. Decay scores for remaining bounding boxes: si=sidecay (IoU (bmax, bi)), bi∈B

3. Update and sort the list of bounding boxes: B → sort (B,key=si)

4. Experiments

4.1. Datasets and settings

The VisDrone2019 dataset (Zhu et al. Citation2021) was used to evaluate our model; This large-scale UAV image dataset is used for target detection and tracking. It contains video sequences and images taken from drones, covering a variety of objectives in multiple urban scenarios and different weather conditions, with target categories including pedestrians, people, bicycles, cars, vans, trucks, tricycles, awning tricycles, buses, and motors. The target bounding box and category labels in images are precisely labelled. The VisDrone2019 dataset has 8629 images, consisting of 6471 images for training, 548 for validation, and 1610 for testing. The resolution of most images is around 1920 × 1080 pixels, which can provide clear target details and scene information.

4.2. Implementation and evaluation metrics

  1. Implementation: We used an NVIDIA RTX 4090 graphics card for model training and testing. Training was first performed on the VisDrone2019 dataset, with 150 epochs, where training was stopped if there was no improvement within 20 epochs; the batch size was 8, and the input image size was 640 pixels. The optimizer used SGD with a validation set of VisDrone2019-DET-val, IoU (Intersection over Union) threshold of 0.7, initial learning rate of 0.01, final learning rate of 0.0001, momentum of 0.937, and weight decay of 0.0005, with other parameters consistent with YOLOv8 default parameters. Core training parameters can be viewed in .

  2. Evaluation Metrics: The precision rate is the proportion of true positive samples out of all samples classified as positive, (2) P=TPTP+FP (2)

Table 1. Core training parameters and their values.

where TP is the number of samples correctly predicted as positive, and FP is the number of false positives.

Recall is the proportion of correctly predicted positive samples, (3) R=TPTP+FN (3) where TP is the number of correctly predicted positive cases, and FN is the number of false negatives.

The mAP is calculated as the average of the areas under the Precision-Recall curve, which are generated by varying the IoU value. In the target detection task, the model generates a series of prediction results based on different confidence thresholds, from which Precision and Recall are calculated, and the Precision-Recall curve is then plotted. The mAP is the mean of the area under the curve, so as to comprehensively evaluate the accuracy of the model at different recall rates, where (4) AP=01P(R)dR (4) and (5) mAP=1Ni=1NAPi (5) where P is Precision, R is Recall, and N is the number of classes. The AP is calculated by averaging 10 IoU values ranging from 0.5 to 0.95 with increments of 0.05. AP50 was derived when IoU = 0.5, for example. i=1NAPi sums all classes of AP.

4.3. Ablation studies

The VisDrone2019-DET-val dataset is affected by factors such as weather, light, topography and environment, occlusions, and sensor performance, and detection is difficult. We verified this dataset three times to obtain the average accuracy, with results as shown in . We tested the original YOLOv8-s and YOLOv8-x models as a baseline, with mAP scores of 23.2 and 28.1, respectively, at 640 resolution, as well as the other modules for fusion. In the YOLOv8 model, the scaling factors "s" and "x" adjust the model size, with "s" having a larger impact, significantly reducing the model. This allows us to balance accuracy and efficiency according to their needs.

Table 2. Test results for individual improvements on the VisDrone2019-DET-val dataset. The scales for YOLOv8 are shown in parentheses. Best results are shown in boldface. Soft-NMS only impacts post-processing, without affecting model’s parameters and floating-point operations (FLOPs). mAP: Mean average precision (mAP). mAP50: Mean average precision (IoU = 0.5). FLOPs: Floating-point operations. SPD: Space to depth. MultiSEAM: Multi-Separation and Enhancement Attention Module. BiFPN: Bi-directional feature pyramid network. Soft-NMS: Soft non-maximum suppression. s and x represent the scaling factors of the YOLOv8 model that influence the model size.

  1. SPD: By adding the SPD module above each C2f module, extraction of fine-grained information in the feature map is enhanced, especially for mutual occlusion between detected objects. The training parameters were consistent with those of the baseline model, and mAP scores were 30.5 and 35.9 for s and x, respectively. The training index visualization is shown in .

  2. MultiSEAM: By incorporating an adaptive attention module after the output feature maps of the head section, the module extracts and weights the features of small, medium, and large scales; fuses features of different scales; retains useful features; and supplants noise to improve the generalization performance of the model. MultiSEAM is an effective solution for addressing the issue of inaccurate detection caused by obstacles such as trees or vegetation obstructing the detection object in UAV images. The training parameters were consistent with the baseline model, and mAP scores were 27.0 and 32.1 for s and x, respectively. After the module was incorporated with BiFPN structure improvement, which enhances the multiscale feature fusion capability of the MultiSEAM structure, making it more suitable for processing object detection tasks in complex scenes, the detection object detection accuracy was improved in UAV images with distant occluded objects, and the map scores of s and x of the MultiSEAM structure after improvement were 27.5 and 32.4, respectively. The training index is visualized in .

  3. Increasing the target detection head: The baseline YOLOv8 model has three detection heads, which are capable of better detection on multiple scales, but there were often problems such as poor localization accuracy, limited detection effect on small targets, and insensitivity to detail information of the target. We adopted four target assay heads for target detection, which effectively alleviated the above problems. The training parameters were consistent with the baseline model, with mAP scores of 26.1 and 31.9 for s and x, respectively. The training index is visualized in .

  4. Soft-NMS algorithm: Soft-NMS is an effective technique that mitigates the problem of overlooking accurate detections by adaptively modifying the confidence scores of potential bounding boxes. In addition, only new validation of the original model was needed, and no new training was required. The improved baseline models YOLOv8-s and YOLOv8-x had mAP scores of 30.7 and 34.9, respectively. This change affected post-processing, without retraining the model.

Figure 6. Precision, recall, mAP_0.5 (mAP50), and mAP_0.5:0.95 (mAP) training processes. (a) YOLOv8-s, YOLOv8-x, YOLOv8 + SPD-conv -s, and YOLOv8 + SPD-conv -x models; (b) YOLOv8-s, YOLOv8-x, YOLOv8 + MultiSEAM -s, YOLOv8 + MultiSEAM -x, YOLOv8 + BiFPN + MultiSEAM -s, and YOLOv8 + BiFPN + MultiSEAM -x models. Due to early stop mechanism, training will stop if there is no change in accuracy in 20 epochs, and training will stop at about 100 epochs.

Figure 6. Precision, recall, mAP_0.5 (mAP50), and mAP_0.5:0.95 (mAP) training processes. (a) YOLOv8-s, YOLOv8-x, YOLOv8 + SPD-conv -s, and YOLOv8 + SPD-conv -x models; (b) YOLOv8-s, YOLOv8-x, YOLOv8 + MultiSEAM -s, YOLOv8 + MultiSEAM -x, YOLOv8 + BiFPN + MultiSEAM -s, and YOLOv8 + BiFPN + MultiSEAM -x models. Due to early stop mechanism, training will stop if there is no change in accuracy in 20 epochs, and training will stop at about 100 epochs.

Figure 7. Precision, recall, mAP_0.5 (mAP50), and mAP_0.5:0.95 (mAP) training processes. (a) YOLOv8-s, YOLOv8-x, YOLOv8 + DetectHead × 4 -s, and YOLOv8 + DetectHead × 4 -x models; (b) YOLOv8-s, YOLOv8-x, YOLOv8 + BiFPN + SPD + MultiSEAM -s, YOLOv8 + BiFPN + SPD + MultiSEAM -x, YOLOv8 + BiFPN + SPD + MultiSEAM + DetectHead × 4 -s, and YOLOv8 + BiFPN + SPD + MultiSEAM + DetectHead × 4 -x models. Due to the early stop mechanism, training will stop if there is no change in 20 epochs, and training will stop at about 100 epochs.

Figure 7. Precision, recall, mAP_0.5 (mAP50), and mAP_0.5:0.95 (mAP) training processes. (a) YOLOv8-s, YOLOv8-x, YOLOv8 + DetectHead × 4 -s, and YOLOv8 + DetectHead × 4 -x models; (b) YOLOv8-s, YOLOv8-x, YOLOv8 + BiFPN + SPD + MultiSEAM -s, YOLOv8 + BiFPN + SPD + MultiSEAM -x, YOLOv8 + BiFPN + SPD + MultiSEAM + DetectHead × 4 -s, and YOLOv8 + BiFPN + SPD + MultiSEAM + DetectHead × 4 -x models. Due to the early stop mechanism, training will stop if there is no change in 20 epochs, and training will stop at about 100 epochs.

With these improved methods, we improved the performance of the model on the VisDrone2019-DET-val dataset, especially with multiple interfering factors and complex scenarios. Eventually, after a series of tuning parameters, our SDMSEAF-YOLOv8 model, s and x, had mAP scores of 39.4 and 42.9, respectively. The training index is visualized in .

4.4. Comparisons with state-of-the-art models

We recently conducted a thorough comparison between the experimental results and the models evaluated using the VisDrone2019-DET-val dataset (see for details). Our proposed method has two variants, SDMSEA-YOLOv8 and SDMSEAF-YOLOv8, which use different scaling factors s and x, respectively. Our model exhibits varying levels of performance in terms of mAP, mAP50, and Img/s. SDMSEA-YOLOv8-s achieved a mAP of 38.2% and a mAP50 of 56.7%, with a processing speed of 14.8 Img/s. A slight enhancement was observed in SDMSEAF-YOLOv8-s, with a mAP of 39.4% and a mAP50 of 58.4%, while the speed increased to 19.4 Img/s. In the x scale, the SDMSEA-YOLOv8-x method had a mAP of 41.4% and a mAP50 of 59.8%, and SDMSEAF-YOLOv8-x had the highest mAP, 42.9%, and a mAP50 of 62.0% but a slower processing speed of 11.2 Img/s, demonstrating a balance between accuracy and efficiency. Our mAP score was higher than those of the other models, and detection was faster.

Table 3. Comparison of our method with state-of-the-art target detection methods on VisDrone2019-DET-val dataset. Best results indicated in boldface.

5. Limitations and discussion

Overall, our improved SDMSEA-YOLOv8 and SDMSEAF-YOLOv8 significantly enhance detection accuracy through various modifications and supplementary modules. However, it is important to note that these models may require more computational resources and time, and they exhibit slower performance compared to the baseline YOLOv8. In particular, SDMSEAF-YOLOv8 introduces an additional detection head in comparison to SDMSEA-YOLOv8. While SDMSEAF-YOLOv8 achieves higher accuracy than SDMSEA-YOLOv8, this comes at the expense of a slower detection speed. In relatively bad weather and at night, there will be a decrease in image recognition accuracy of drone aircraft to distant locations. This may be caused by the aggravation of fine-grained information in the image that is disturbed by environmental factors, making the edges and features of target objects obscure, thus reducing detection performance.

shows the detection results of the SDMSEAF-YOLOv8 and YOLOv8 baseline models on UAV images, where SDMSEAF-YOLOv8 performs better during daytime, dusk, and evening scenes. In the results of baseline model predictions of YOLOv8, the YOLOv8 baseline model tends to be underpowered, because distant perspective occlusion leads to overlapping or obscuring of the detected object. SDMSEAF-YOLOv8 overcomes this challenge.

Figure 8. Visualization showing SDMSEAF-YOLOv8 and YOLOv8 baseline model detection performance contrasts. (a1), (b1), and (c1) are UAV images during day, dusk, and night, respectively; (a2), (b2), and (c2) are corresponding results of baseline model testing with YOLOv8; (a3), (b3), and (c3) are corresponding SDMSEAF-YOLOv8 model detection results.

Figure 8. Visualization showing SDMSEAF-YOLOv8 and YOLOv8 baseline model detection performance contrasts. (a1), (b1), and (c1) are UAV images during day, dusk, and night, respectively; (a2), (b2), and (c2) are corresponding results of baseline model testing with YOLOv8; (a3), (b3), and (c3) are corresponding SDMSEAF-YOLOv8 model detection results.

6. Conclusions

The variant SDMSEAF-YOLOv8 model combines BiFPN, SPD, MultiSEAM, an additional detection head, and Soft-NMS to alleviate the problem of distant perspective occlusion of small targets, and it effectively improves the detection performance of UAV images. SDMSEAF-YOLOv8 first performs feature extraction. SPD and MultiSEAM structures can reduce the loss of detailed information and imprecision in feature representation. Multiscale feature fusion is carried out bidirectionally in parallel. Comprehensive feature map information fusion is performed on features of different scales, and through the inclusion of an additional detection head, the problem of the detection performance of small target objects is solved. In postprocessing, the traditional NMS is replaced with the Soft-NMS algorithm to accurately handle overlapped detection boxes, which improves detection accuracy, especially when there is a large amount of small target object overlap and occlusion in UAV images. In contrast, the variant SDMSEA-YOLOv8 does not incorporate the extra detection head, resulting in faster processing speeds and lower computational resource requirements, albeit with a slight decrease in detection accuracy for small objects. Experiments on the VisDrone2019-DET-val dataset showed the effectiveness of the proposed variant SDMSEAF-YOLOv8 model in the detection of UAV images, with 14.8% higher mAP and 6% higher detection speed than the baseline model for YOLOv8-x, 17.1% higher than the mAP of the YOLOv8-x baseline algorithm model, and 6% higher than the mAP of the known state-of-the-art Fine-Grained Target Focusing Network (FiFoNet) model. In the future, we intend to improve the way we measure the similarity of bounding boxes in target detection, helping the model learn better bounding box predictions.

Data availability statement

Our research code can be obtained from https://github.com/FanbinMeng-Group/SDMSEA-YOLOv8.git.

Disclosure statement

The authors declare that there are no conflicts of interest regarding this publication.

Additional information

Funding

This work was partly supported by the Jining Medical University Classroom Teaching Reform Research Project (Grant No. 2022KT012) and the Innovation and Entrepreneurship Training Program for College Students (Grant Nos. 202210443002, 202210443003, and S202310443006).

References

  • Bodla N, Singh B, Chellappa R, Davis LS. 2017. Soft-NMS–improving object detection with one line of code. Paper Presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Bochkovskiy A, Wang C-Y, Liao H-YM. 2020. Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. doi: 10.48550/arXiv.2004.10934.
  • Cai Z, Vasconcelos N. 2019. Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell. 43(5):1483–1498. doi: 10.1109/TPAMI.2019.2956516.
  • Deng S, Li S, Xie K, Song W, Liao X, Hao A, Qin H. 2021. A global-local self-adaptive network for drone-view object detection. IEEE Trans Image Process. 30:1556–1569. doi: 10.1109/tip.2020.3045636.
  • Jocher G, Chaurasia A, Qiu J. “YOLO by Ultralytics.” Accessed. 2023. https://github. com/ultralytics/ultralytics., February 30, 2023.
  • Jocher G. 2020. “YOLOv5 by Ultralytics.” https://github.com/ultralytics/yolov5., Accessed: February 30, 2023.
  • Gao J, Chen Y, Wei Y, Li J. 2021. Detection of specific building in remote sensing images using a novel YOLO-S-CIOU model. Case: gas station identification. Sensors (Basel). 21(4):1375. doi: 10.3390/s21041375.
  • Ge Z, Liu S, Wang F, Li Z, Sun J. 2021. Yolox: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430. doi: 10.48550/arXiv.2107.08430.
  • Girshick R. 2015. Fast r-cnn. Paper Presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Girshick R, Donahue J, Darrell T, Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition. doi: 10.1109/CVPR.2014.81.
  • Guo D, Wang Y, Zhu S, Li X. 2023. A vehicle detection method based on an improved u-yolo network for high-resolution remote-sensing images. Sustainability. 15(13):10397. doi: 10.3390/su151310397.
  • Hu J, Shen L, Sun G. 2018. Squeeze-and-excitation networks. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition. doi: 10.1109/CVPR.2018.00745.
  • Kou X, Liu S, Cheng K, Qian Y. 2021. Development of a YOLO-V3-based model for detecting defects on steel strip surface. Measurement. 182:109454. doi: 10.1016/j.measurement.2021.109454.
  • Li C, Li L, Jiang H, Weng K, Geng Y, Li L, et al. 2022. YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976. doi: 10.48550/arXiv.2209.02976.
  • Li C, Yang T, Zhu S, Chen C, Guan S. 2020. Density map guided object detection in aerial images. Paper presented at the proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. doi: 10.1109/CVPRW50498.2020.00103.
  • Li Z, Yuan J, Li G, Wang H, Li X, Li D, Wang X. 2023. Rsi-yolo: object detection method for remote sensing images based on improved yolo. Sensors (Basel). 23(14):6414. doi: 10.3390/s23146414.
  • Li Z, Wang Y, Chen K, Yu Z. 2022. Channel Pruned YOLOv5-based Deep Learning Approach for Rapid and Accurate Outdoor Obstacles Detection. arXiv Preprint. arXiv:2204.13699. doi: 10.48550/arXiv.2204.13699.
  • Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. 2017. Feature pyramid networks for object detection. Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. doi: 10.1109/CVPR.2017.106.
  • Lin T-Y, Goyal P, Girshick R, He K, Dollár P. 2017. Focal loss for dense object detection. Paper presented at the Proceedings of the IEEE international conference on computer vision. doi: 10.1109/ICCV.2017.324.
  • Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC. 2016. Ssd: single shot multibox detector. Paper presented at the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. doi: 10.1007/978-3-319-46448-0_2.
  • Liu Z, Gao Y, Du Q. 2023. Yolo-class: detection and classification of aircraft targets in satellite remote sensing images based on yolo-extract. IEEE Access. 11:109179–109188. doi: 10.1109/ACCESS.2023.3321828.
  • Long X, Deng K, Wang G, Zhang Y, Dang Q, Gao Y, et al. 2020. PP-YOLO: an effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099. doi: 10.48550/arXiv.2007.12099.
  • Luo X, Wu Y, Wang F. 2022. Target detection method of UAV aerial imagery based on improved YOLOv5. Remote Sensing. 14(19):5063. doi: 10.3390/rs14195063.
  • Ma Y, Yu D, Wu T, Wang H. 2019. Paddlepaddle: an open-source deep learning platform from industrial practice. Frontiers of Data and Domputing. 1(1):105–115. doi: 10.11871/jfdc.issn.2096.742X.2019.01.011.
  • Pham M-T, Courtrai L, Friguet C, Lefèvre S, Baussard A. 2020. YOLO-Fine: one-stage detector of small objects under various backgrounds in remote sensing images. Remote Sensing. 12(15):2501. doi: 10.3390/rs12152501.
  • R. team. 2023. “YOLO-NAS by Deci Achieves State-of-the-Art Performance on Object Detection Using Neural Architecture Search.” https://deci.ai/blog/yolo-nas-object-detection-foundation-model/., Accessed: May 12, 2023.
  • Redmon J, Divvala S, Girshick R, Farhadi A. 2016. You only look once: unified, real-time object detection. Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. doi: 10.1109/CVPR.2016.91.
  • Redmon J, Farhadi A. 2017. YOLO9000: better, faster, stronger. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition. doi: 10.1109/CVPR.2017.690.
  • Redmon J, Farhadi A. 2018. Yolov3: an incremental improvement. arXiv Preprint. arXiv:1804.02767. doi: 10.48550/arXiv.1804.02767.
  • Ren S, He K, Girshick R, Sun J. 2015. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 39(6):1137–1149. doi: 10.1109/TPAMI.2016.2577031.
  • Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. 2019. Generalized intersection over union: a metric and a loss for bounding box regression. Paper presented at the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. doi: 10.1109/CVPR.2019.00075.
  • Sunkara R, Luo T. 2022. No more strided convolutions or pooling: a new CNN building block for low-resolution images and small objects. Paper presented at the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. doi: 10.1007/978-3-031-26409-2_27.
  • Suo J, Wang T, Zhang X, Chen H, Zhou W, Shi W. 2023. HIT-UAV: a high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection. Sci Data. 10(1):227. doi: 10.1038/s41597-023-02066-6.
  • Su Z, Yu J, Tan H, Wan X, Qi K. 2023. Msa-yolo: a remote sensing object detection model based on multi-scale strip attention. Sensors (Basel). 23(15):6811. doi: 10.3390/s23156811.
  • Tan M, Le Q. 2019. Efficientnet: rethinking model scaling for convolutional neural networks. Paper presented at the International conference on machine learning.
  • Wang C, Bochkovskiy A, Liao H. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022. arXiv preprint arXiv:2207.02696. doi: 10.1109/CVPR527292023.00721.
  • Wang C-Y, Bochkovskiy A, Liao H-YM. 2021. Scaled-yolov4: scaling cross stage partial network. Paper presented at the Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. doi: 10.1109/CVPR46437.2021.01283.
  • Wang J, Xu C, Yang W, Yu L. 2021. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv preprint arXiv:2110.13389. doi: 10.48550/arXiv.2110.13389.
  • Wang W, Dai J, Chen Z, Huang Z, Li Z, Zhu X, Hu X, Lu T, Lu L, Li H. 2023. Internimage: exploring large-scale vision foundation models with deformable convolutionsed. Eds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14408–14419.
  • Wang Y, Yang Y, Zhao X. 2020. Object detection using clustering algorithm adaptive searching regions in aerial images. Paper presented at the European Conference on Computer Vision. doi: 10.1007/978-3-030-66823-5_39.
  • Wu T, Dong Y. 2023. Yolo-se: improved yolov8 for remote sensing object detection and recognition. Appl Sci. 13(24):12977. doi: 10.3390/app132412977.
  • Xi Y, Jia W, Miao Q, Liu X, Fan X, Li H. 2022. FiFoNet: fine-grained target focusing network for object detection in UAV images. Remote Sens. 14(16):3919. doi: 10.3390/rs14163919.
  • Xu X, Jiang Y, Chen W, Huang Y, Zhang Y, Sun X. 2022. Damo-yolo: a report on real-time object detection design. arXiv preprint arXiv:2211.15444. doi: 10.48550/arXiv.2211.15444.
  • Xu D, Wu Y. 2021. Fe-yolo: a feature enhancement network for remote sensing target detection. Remote Sensing. 13(7):1311. doi: 10.3390/rs13071311.
  • Yang C, Huang Z, Wang N. 2022. QueryDet: cascaded sparse query for accelerating high-resolution small object detection. Paper Presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. doi: 10.1109/CVPR52688.2022.01330.
  • Yang F, Fan H, Chu P, Blasch E, Ling H. 2019. Clustered object detection in aerial images. Paper presented at the Proceedings of the IEEE/CVF international conference on computer vision. doi: 10.1109/ICCV.2019.00840.
  • Yu W, Yang T, Chen C. 2021. Towards resolving the challenge of long-tail distribution in UAV images for object detection. Paper Presented at the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. doi: 10.1109/WACV48630.2021.00330.
  • Yu W, Zhou P, Yan S, Wang X. 2023. Inceptionnext: when inception meets convnext arXiv preprint arXiv:2303.16900. doi: 10.48550/arXiv.2303.16900.
  • Yu Z, Huang H, Chen W, Su Y, Liu Y, Wang X. 2022. Yolo-facev2: a scale and occlusion aware face detector. arXiv preprint arXiv:2208.02019. doi: 10.48550/arXiv.2208.02019.
  • Yu Z, Liu Y, Yu S, Wang R, Song Z, Yan Y, Li F, Wang Z, Tian F. 2022. Automatic detection method of dairy cow feeding behaviour based on YOLO improved model and edge computing. Sensors (Basel). 22(9):3271. doi: 10.3390/s22093271.
  • Zhang H, Hao C, Song W, Jiang B, Li B. 2023. Adaptive Slicing-Aided Hyper Inference for Small Object Detection in High-Resolution Remote Sensing Images. Remote Sensing. 15(5):1249. doi: 10.3390/rs15051249.
  • Zhu L, Wang X, Ke Z, Zhang W, Lau RW. 2023. BiFormer: vision transformer with bi-level routing attentioned. eds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 10323–10333.
  • Zhu P, Wen L, Du D, Bian X, Fan H, Hu Q, Ling H. 2021. Detection and tracking meet drones challenge. IEEE Trans Pattern Anal Mach Intell. 44(11):7380–7399. doi: 10.1109/TPAMI.2021.3119563.
  • Zhu X, Lyu S, Wang X, Zhao Q. 2021. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. Paper presented at the Proceedings of the IEEE/CVF international conference on computer vision. doi: 10.1109/ICCVW54120.2021.00312.