Full article: FastPFM: a multi-scale ship detection algorithm for complex scenes based on SAR images

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Synthetic Aperture Radar (SAR) is renowned for its all-weather capabilities, exceptional penetration, and high-resolution imaging, making SAR-based ship detection crucial for maritime surveillance and sea rescue operations. However, various challenges, such as blurred ship contours, complex backgrounds, and uneven scale distribution, can impede detection performance improvement. In this study, we propose FastPFM, a novel ship detection model developed to address these challenges. Firstly, we utilize FasterNet as the backbone network to reduce computational redundancy, enhancing feature extraction efficiency and overall computational performance. Additionally, we employ the Feature Bi-level Routing Transformation model (FBM) to obtain global feature information and enhance focus on target regions. Secondly, the PFM module is engineered to collect multi-scale target information effectively by establishing connections across stages, thereby improving fusion of target features. Thirdly, an extra target feature fusion layer is introduced to enhance small ship detection precision and accommodate multi-scale targets. Finally, comprehensive tests on SSDD and HRSID datasets validate FastPFM's efficacy. Compared to the baseline model YOLOX, FastPFM achieves a 5.5% and 4.4% improvement in detection accuracy, respectively. Furthermore, FastPFM demonstrates comparable or superior performance to other detection algorithms, achieving 92.1% and 83.1% accuracy on AP50, respectively.

KEYWORDS:

1. Introduction

Advances in satellite imagery technology have made it easier to obtain high-resolution and wide-area photographs from space, which has improved assistance for a number of sectors, such as environmental monitoring, traffic management, and natural catastrophe identification. Furthermore, it has opened up promising opportunities for ship detection. In ocean surveillance, ship detection is essential because it makes it possible to perform duties like ship identification, tracking, and monitoring as well as to ensure maritime safety and facilitate rescue operations. These endeavours are paramount in safeguarding marine security, managing marine resources, and preserving the integrity of our oceans.

Traditional ship target detection methods (Leng et al., Citation2015; Liu et al., Citation2019) predominantly rely on Constant False Alarm Rate (CFAR) algorithms. However, these algorithms frequently found it difficult to handle complex coasts and marine clutter since they relied on human expertise and clutter data, which reduced their accuracy and adaptability. Therefore, these methods suffer from low detection precision and poor generalisation capacity in complicated scenarios, such as densely packed ships along the coast, buildings, beaches, tiny islands, and sea clutter, which can affect the detection accuracy of ships. Moreover, their performance may be limited under adverse weather conditions such as dense fog or nighttime. Therefore, in order to maintain the efficacy and security of maritime trade, a more accurate and efficient way of identifying ship targets is needed.

Ship identification has advanced to a new level thanks to the advancement of deep learning (DL) and convolutional neural network models (CNNs), which have addressed the shortcomings of earlier techniques. CNN-based ship detection has the advantages of more vital feature learning ability, better generalisation ability, better adaptability, and higher detection accuracy. The SAR technology utilises the synthetic aperture effect of the radar beam to obtain high-resolution images. In addition, it has all-weather imaging capability, which is almost independent of weather conditions (e.g. clouds, haze, etc.) and lighting conditions (e.g. night) (Li et al., Citation2021; Zhang & Hao, Citation2022). Deep learning-based SAR ship identification techniques have arisen as a result. This brings significant advantages for ship monitoring. Currently, one-stage ship detection (Tang et al., Citation2021) and two-stage ship detection methods (Lin et al., Citation2019; Zhao et al., Citation2020) are the two main research directions of DL-based SAR ship identification algorithms. Two-stage object identification methods take a two-step strategy, generating candidate regions first, then classifying and regressing position on those areas. While this approach achieves high detection accuracy, it is slow and computationally expensive, contradicting the “lightweight” concept and the real-time requirements of vessel detection. One-stage object identification algorithms, which offer simplicity and speed and are appropriate for real-time ship recognition during ship navigation, directly anticipate the position and kind of targets from the input image. Thus, we will adopt the one-stage detection method in this study.

Unfortunately, while SAR (Synthetic Aperture Radar) imagery brings developments and convenience to ship detection, it also presents several challenges. Firstly, sea waves and other disturbances frequently impact ship targets in SAR photos, resulting in hazy borders that make recognition and extraction difficult. Secondly, the water often looks dark in SAR photos, whereas ships are typically brilliant white. However, nearby islands, structures, and surrounding noise may also display comparable bright white hues, greatly enhancing the challenge of ship recognition (Zhao et al., Citation2023). Additionally, ships in motion can undergo significant geometric transformations in SAR images, resulting in blurred ship images as well as irregular patterns in the ship's silhouette (Wang et al., Citation2023a, Citation2023b). Lastly, in SAR images, the shape and size of ship targets may vary, and small ships dominate, resulting in an imbalanced distribution of ship scales and further increasing the challenges of target detection.

Since SAR ship images have some of the factors mentioned above that are unfavourable to the model detection performance, it brings a significant challenge to ship detection technology. First, the detailed features will be lost in the detection since the boat object in the SAR picture is hazy and susceptible to interference. FPN (Feature Pyramid Network) uses up-sampling and down-sampling to fuse different levels of features, so it may lead to the loss of detailed features after fusion or contain noise, which ultimately reduces the performance of the model detection. Secondly, tiny ships are only shown as a bright point in SAR photographs in real-world circumstances since small and medium-sized ships make up the bulk of SAR ship images. The distribution of ships along the shoreline or in narrow channels is more dense and affected by the buildings along the coastline. During detection, detecting dense small ships and differentiating between ships and buildings along the shoreline is challenging. Finally, many existing models have feature extraction networks and feature fusion networks that are computationally intensive and have poor robustness. These structures are usually operated by constant stacking with fixed up and downsampling. Because of this, the model's overall capacity for feature extraction and fusion is hampered, which lowers the model's capacity for generalisation and increases its processing power – a need that is insufficient for detection in real time.

To address the above challenges, we introduce FastPFM, a ship detection method building upon the YOLOX (You Only Look Once X) (Ge et al., Citation2021) algorithm to mainly target complex scenes and multi-scale scenarios. This method improves the model's ability to multi-scale ship combination of features, which improves the recognition of ships in SAR picture datasets and gathers more detailed contextual information, in addition to explicitly enhancing the model's extraction of features capability for tiny ships to alleviate the problem of low detection accuracy for large vessels caused by the dataset's uneven distribution of ship sizes.

The following is a summary of this paper's significant contributions:

We proposed a YOLOX-based ship detection method, FastPFM, which adopts FasterNet as the foundation network of FastPFM to increase computational efficiency and detection accuracy and better utilise redundant data in feature maps. Additionally, we designed the FBM (Feature Bi-level Routing Transformation Module) module to divide high-level feature information into two parts in the backbone network and process them using BRAttention, followed by merging through a cross-stage structure. This design enables better capture of global features, achieving higher precision in ship detection.
Designed a PFM (PConv Feature Module) module to utilise PConv convolution to reduce the model's computational burden and memory access. Furthermore, we employ a cross-stage connection structure to effectively gather the targets’ multiple scales and semantic characteristics, improving the object detection process's accuracy and robustness.
Conducted model performance and generalisation tests compared to baseline YOLOX and other state-of-the-art (SOTA) methods on benchmark datasets HRSID (Wei et al., Citation2020) and SSDD (Zhang, Zhang, Li, et al., Citation2021). The experimental results demonstrate a significant improvement in detection accuracy with the FastPFM model compared to YOLOX. Furthermore, compared to other SOTAs, FastPFM performs better all around.

The remainder of this article is organised as follows: Part II provides an overview of important studies to our work; Part III discusses the proposed method; Part IV discusses the outcomes of the experiments and analysis, and Part V finishes with an overview of our work and future plans.

2. Related work

To provide readers a better knowledge of current breakthroughs in the field, this chapter focuses on introducing several traditional and DL-based SAR ship recognition algorithms, which played an important role in the development of our research technique.

2.1. Traditional ship detection algorithm

Conventional SAR ship target recognition algorithms heavily rely on the CFAR (Steenson, Citation1968), which dynamically adjusts thresholds based on variations in background noise, ensuring a consistent false alarm rate under different conditions. Despite effectively suppressing the influence of background clutter, leading to improved target detection accuracy, the CFAR algorithm is vulnerable to complicated coastlines, sea waves, and speckle noise since it relies on expert knowledge and handcrafted features, which may lead it to face lower detection accuracy and limited generalisation. Moreover, the effectiveness of handcrafted features in all situations is challenging to ensure, limiting overall robustness.

Many researchers work on enhancing the detection performance of the CFAR algorithm, reducing its sensitivity to complex shorelines, etc. Because the probabilistic neural network (PNN) model still needs to be enhanced in terms of recognition precision and speed in SAR pictures, Du et al. (Du et al., Citation2008) processed 8-bit and 16-bit SAR images, proposed a C-PNN model, and used a new CFAR algorithm that outperformed traditional PNN-based ship detection. Furthermore, because ship recognition techniques usually rely on high-resolution SAR pictures that are impacted by background clutter, many ship detections employ CFAR algorithms. Migliaccio et al. (Migliaccio et al., Citation2008) suggested a ship detection filtering approach using SLC pictures that can successfully cope with scattering patches in the image while maintaining a low false alarm rate. Traditional ship detection algorithms cannot recognise near-shore ships successfully due to the great similarity between ports and ships in terms of grey and texture properties. Zhai et al. (Zhai et al., Citation2016) recommended combining saliency and contextual information to solve this problem. They proposed a ship identification frame to reduce false alarms and hence improve the model's whole detection performance. Overall, these research are aimed at enhancing the model's accuracy and effectiveness in SAR ship identification, as well as addressing the numerous problems encountered.

Despite their advantages, traditional ship detection algorithms still have several drawbacks, such as limited versatility, poor adaptability to varied settings, and inability to satisfy the needs of immediate detection. Furthermore, typical ship detection method frequently make it difficult to effectively distinguish ships in multi-scale and complicated settings. Therefore, deep learning-based SAR ship detection algorithms are becoming dominant, gradually surpassing these traditional methods (Zheng, Li, et al., Citation2023).

2.2. Deep learning based SAR ship detection algorithm

Throughout recent times, DL-based techniques for synthetic-aperture radar (SAR) pictures have been developed to overcome the constraints of classic ship detection systems, and they can be divided into two categories: one-stage and two-stage ship detection algorithms. Deep learning is currently making strides with enormous amounts of labelled data. However, due to the shortage of large volumes of labelled data, developing semi-supervised ship detection algorithms is critical, limiting the broad application of deep learning (Rai et al., Citation2022). In the past few years, an increasing number of researchers have applied semi-supervised learning frameworks on SAR ship datasets to increase ship detection accuracy (Wang, Shi, et al., Citation2021; Zhou, Jiang, et al., Citation2023). Several research papers employ deep learning-driven ship identification approaches to ship trajectory prediction and container loading (Qian et al., Citation2022; Zong & Wan, Citation2022).

Despite significant advances in the use of deep learning in ship detection systems for SAR imagery, numerous hurdles and limitations remain. First, due to the unique nature of SAR images, such as image backdrop clutter and target complexity, building effective feature extraction and classifiers remains difficult. Second, while semi-supervised learning approaches help ameliorate the problem of insufficient labelled data, the question of how to efficiently use unlabelled data remains unanswered. Furthermore, applying deep learning approaches to domains like ship trajectory prediction and container loading necessitates additional research and exploration. Future research can concentrate on resolving these issues in order to increase the performance and dependability of ship detection and related applications.

2.2.1. Deep learning based two-stage SAR ship detection algorithm

The emergence of two-stage object detection methods has proven beneficial in addressing the limits of traditional methods. These methods enhance the performance and robustness of ship recognition by utilising deep learning's powerful feature learning capability, multi-scale detection capability, and efficient NMS (Non-maximum suppression). Currently, prominent two-stage object detection algorithms mainly contain Fast R-CNN (Girshick, Citation2015), Faster R-CNN (Ren et al., Citation2015), and Cascade R-CNN (Cai & Vasconcelos, Citation2019). Because traditional methods require segmenting the coast before detection but are inaccurate, resulting in poor detection performance, Lin et al. (Lin et al., Citation2019) put forward a Faster R-CNN-based structure that uses squeezing and stimulation techniques to improve ship detection efficiency. Furthermore, due to the effect of different ship sizes and various backdrops in SAR images, recognising multi-scale ships in complex surroundings remains difficult. As a result, Zhao et al. (Zhao et al., Citation2020) proposed an Attention-Receptive Pyramid Network (ARPN) by combining Receiving Field Block (RFB) and Convolutional Block Attention Module (CBAM), which improves the ability to identify of multiple scales ships in complicated background SAR images by enhancing vital information and suppressing interference caused by the surroundings. Zhang et al. (Zhang et al., Citation2021) proposed a novel Quad-Feature Pyramid Network (Quad-FPN) for SAR ship detection, made up of four unique FPNs, i.e. the fusion of features from multiple FPNs allows the model to contain more helpful ship shape information while reducing the influence of complex background. Because scattering noise in SAR images reduces the quality of the model for feature learning and impedes understanding of semantic features for target detection, Zhang et al. (X. Zhang et al., Citation2021) proposed a multi-task learning-based target detector (MTL-Det) for identifying ships in SAR pictures, which models the ship identification problem as three collaborative tasks, and by multi-tasking learning it can learn features of more discriminative targets.

Despite the benefits of precision, strong feature learning capabilities, and robust multi-scale detection abilities, two-stage object detection methods are relatively complex, slower, and have limited capabilities in detecting small objects. Hence, careful consideration of practical requirements and scenarios is essential when selecting a ship detection algorithm.

2.2.2. Deep learning-based one-stage SAR ship detection algorithm

The main one-stage ship identification systems are the YOLO series, SSD (Liu et al., Citation2016), FCOS (Tian et al., Citation2019), or RetinaNet (Lin, Goyal, et al., Citation2017). These approaches directly detect and locate objects, offering advantages such as lower computational complexity and faster detection times. For this reason, they are well-suited for real-time ship detection, making them frequently employed in recent works. However, ship detection is somewhat hampered by the unbalanced ship scale distribution in the SSDD and HRSID SAR datasets.

Improvements have been suggested by researchers as a solution to the scale imbalance issue. The introduction of the Feature Pyramid Network (FPN) by Lin et al. (Lin, Dollár, et al., Citation2017) addressed this issue by fusing multi-scale ship features using feature maps of various scales, thereby improving the model's capability to represent ship goals and effectively handling multi-scale ships. Several scholars have made improvements to FPN for detecting multi-scale ships. When merging semantic and spatial location, Zhu et al. introduced DB-FPN based on YOLOv5 (Zhu et al., Citation2021), which enhances the process by employing a recurring structure. However, the detection capabilities of DB-FPN are constrained in complicated scenarios. To enhance the detection capability of models in such scenarios, Zhou et al. (Zhou, Fu, et al., Citation2023) introduced the D-MFPN, which refines the expressiveness of feature maps by integrating a multi-layer feature pyramid network and attention modules. By combining multi-scale ship feature information through both bottom-up and top-down routes. Sun et al. (Sun, Leng, et al., Citation2021) introduced Bi-DFFM based on the YOLO algorithm framework, improving the recognition capabilities for multiple scales ships. Yasir et al. (Yasir et al., Citation2023) increased the accuracy of YOLOv5 for multi-scale ship detection by leveraging C3, the attention mechanism, and the PAFPN structure. The great majority of the ships in the SAR dataset are small ships, and the model's total detection capabilities is mostly determined by its detection accuracy. In order to solve the small target detection problem, Chen et al. (J. Chen et al., Citation2022) suggested a DFM based on the depth of difference. This DFM maps the various difference regions to the two-dimensional candidate region in accordance with the distance.

Besides the issue of imbalanced ship scale. Because there aren't many ships in the water, the majority of the anchor frames are unnecessary, which could make the model's computation take longer. Fu et al. (Fu et al., Citation2020) proposed a new method called Feature Balancing and Optimization Network (FBR-Net). The network employs a generalised anchorless frame strategy that contains a structure with an attention-guided balancing pyramid. The approach can recognise multi-scale ships by improving the semantic information and improving the object features through the use of a feature refinement module. In addition, several researchers have worked on improving classification and regression algorithms and enhancing detection speed and practical application techniques. To detect ships in SAR photos, Gao et al. presented a dense attention feature aggregation network without anchors (Gao et al., Citation2020). To improve the regression algorithms in the FCOS network, Sun et al. updated the object categorisation and bounding box regression techniques (Sun, Dai, et al., Citation2021). They proposed the CP module, which performed well at detecting ships in various datasets of high-resolution SAR images.

While many existing algorithms concentrate detection accuracy, they frequently overlook speed of detection and the technology's practical applications. Most algorithms are not applicable in real-world situations because they rely on strong desktop GPUs. The YOLO-V4-Light network was suggested by Jiang et al. (Jiang et al., Citation2021), which uses a multiple channels fusion SAR imagery processing technique to overcome these problems. Zheng et al. (Zheng, Zhang, et al., Citation2023) proposed the MCYOLOv5 ship detection algorithm as an improvement over the current algorithms. It utilises the lightweight network of MobileNetV3 to extract features and creates an effective CNeB feature fusion module, that further decreases the model complexity while enhancing the model's ability to extract and fuse features. Zheng et al. (Zheng et al., Citation2022) replaced the YOLOv4 feature extraction module with MobileNetV1, which resulted in a significant reduction in computation time while maintaining accuracy in recognition.

Notwithstanding the advantages of one-stage ship detection algorithms, such as speed, low computational complexity, and strong adaptability to small-scale objects, they suffer from relatively low target detection accuracy and weak ability to extract target context information details, resulting in inadequate detection ability in complicated backgrounds. To enhance the model's ability to extract global characteristics and balance computational effort, we thus suggest the FBM module.

3. Methods

In this section, we initially present YOLOX, our baseline model, in this section. Next, we showcase our suggested model's general architectural diagram. Lastly, we discuss the PFM-based multi-scale feature fusion pyramid, FasterNet, FBM, and PFM.

3.1. Baseline model YOLOX

Considering the target scattering, sparsity of ships, and small sample features in SAR images, the anchor-free mechanism, decoupled heads, and dynamic sample allocation strategy of YOLOX (Ge et al., Citation2021) make it an ideal baseline model. For this reason, we decided to use YOLOX as the foundational model in our study. Moreover, YOLOX is frequently used as a baseline model in ship detection tasks in some advanced methods (Peng & Tan, Citation2022; J. Zhang et al., Citation2023). Therefore, YOLOX is also used as a foundational model in this study.

The YOLOX network topology is shown in Figure and is composed of three modules: the Head module for object identification, the Neck module for feature processing, and the Backbone module for feature extraction. YOLOX typically receives 640 * 640 * 3 three-channel RGB images and outputs boxes containing the boundaries of items found together with the pertinent category information. Furthermore, it employs data augmentation methods such as Mixup and Mosaic to enhance the detected objects’ visual backgrounds and mitigate the impact of batch size on the model.

Figure 1. Structure of the YOLOX network model. (BackBone: feature extraction network; PAFPN: feature fusion network; Head: target detection head; the rest are structures used in these three modules)

Backbone. Darknet53, which comprises of CBS, CSP-X, and SPP modules, is the foundation of YOLOX. Convolutional layers, BN layers, and the SiLU function make up the CBS module. The CSP-X module combines CBS modules, ResUnit modules, and Concatenation operations. The ResUnit module consists of CBS modules and addition operations. The SPP module transforms features of arbitrary sizes into fixed-size feature vectors.

Neck. YOLOX uses the PAFPN (Path Aggregation Feature Pyramid Network) topology of the neck portion. This structure takes into account the ambiguity of low-level target information and builds high-level semantic features at various scales through the use of top-down lateral connections. It introduces bottom-up structures to compensate for and reinforce localisation information. PAFPN integrates features from the S2, S3, and S4 layers outputted by the backbone network. Typically, shallow networks contain abundant ship-detail features, advantageous for detecting small vessels, while deeper networks usually possess rich semantic information, favouring the detection of large ships. The PAFPN feature fusion network effectively integrates both shallow and deep features, enabling the network to fully extract hierarchical features, thereby obtaining richer feature information, including strong semantic information, edges, and textures. However, since most ships in SAR images appear as small targets, the PAFPN does not integrate detailed features from the S1 layer, which might have a certain impact on detecting small vessels.

HEAD. YOLOX created the decoupled detection head module, which has several heads, in order to resolve the conflict between the classification and regression jobs. Initially, these heads use 1*1 Conv to minimise the number of channels with features to 256 for classification and regression applications. Then, an addition of two parallel pathways with two 3*3 Conv each. On the regression route, there is also an IoU branch. Another innovation of YOLOX is dynamic sample matching instead of an anchor frame-based approach.

3.2. Our proposed FastPFM network structure

Our suggestion is the FastPFM ship detection network model, which is built around the standard YOLOX architecture. FastPFM is composed of a Backbone, Neck, and Head module, as illustrated in Figure . First, FastPFM preprocesses the input image by adjusting its resolution to 640 by 640 pixels. Next, it normalises the image to aid in better model computation. Lastly, it uses data enhancement techniques like Mixup to combine multiple images with different coefficients, helping the model better learn the target's features and enhancing its robustness and generalisation capabilities. However, downsampling to the required resolution of the model during data preprocessing may cause some detailed features to be lost or become blurred, especially for small ships, which may affect their detection capability. Based on this problem, our multi-scale feature fusion network module of PFM can retain more detailed information during the detection process to obtain better semantic and spatial data, improving the overall network detection performance. Consequently, this enhances the network's detection performance as a whole. Furthermore, the pre-processing step involves adjusting the image resolution to achieve uniform image size, which simplifies the model processing and lowers the computational load on the model, speeding up training and inference.

Figure 2. The FastPFM network's overall framework. (BackBone: feature extraction network; Neck: feature fusion network; Head: target detection head; the rest are structures used in these three modules)

Specifically, compared to the standard YOLOX structure, FastPFM has several following improvements:

We replace the YOLOX backbone network Darknet53 with FasterNet (J. Chen et al., Citation2023) to address the issue of low computational efficiency. Moreover, Darknet53 only outputs feature maps from S2, S3, and S4 layers, lacking shallow detailed features, which results in poor performance in detecting small-sized ships. To tackle this, we add an S1 layer as an output layer in the backbone network, providing the necessary shallow detailed features to enhance the identification of tiny vessels.
FBM is designed to obtain richer global and contextual information while alleviating the loss of detailed features due to multiple downsampling.
Considering the presence of numerous small-sized ship targets in the HRSID dataset, We add the characteristics taken via the S1 layer within the backbone network to the Neck module. This extension enables the Neck module to capture detailed features of smaller targets effectively.
To increase feature fusion and broaden the network's receptive area, the PFM module is integrated into the Neck module. Additionally, the original CBS (Conv, BN, SiLU) module in YOLOX is redesigned using PConv, which improves the model's computational efficiency.

3.3. Our proposed FastPFM backbone network

3.3.1. FasterNet feature extraction network

The backbone network of the original YOLOX, Darknet53, used a Conv convolution, which is not only slow but also causes a lot of floating-point operations (FLOPs), increasing the model's latency. Equation Equation(1)(1) $\begin{aligned} Latency = \frac{FLO P_{S}}{FLOPS} \end{aligned}$ (1) illustrates the link between Latency, FLOPS, and FLOPs. FLOPs stands for floating-point operations, whereas FLOPS stands for floating-point operations per second. Thus, the computation's delay can be expressed as FLOPs divided by FLOPS. (1) $\begin{aligned} Latency = \frac{FLO P_{S}}{FLOPS} \end{aligned}$ (1) The feature maps that YOLOX's Conv convolution produced were extremely redundant across channels, necessitating the use of additional convolution operations to refine the redundant features. As a result, the model computation's efficiency was reduced, its cost increased significantly, and its latency increased significantly. On the other hand, the FasterNet used in this work performs convolution operations on a subset of input channels while maintaining the integrity of the remaining channels by using PConv convolution as opposed to Conv. Thus, FLOPs are reduced while FLOPS is increased, which reduces latency and improves computational speed and accuracy. Figure (a) illustrates the implementation flow of PConv. To ensure continuous memory access, we calculate the first consecutive $Cp$ channels. Equation Equation(2)(2) $\begin{aligned} h \times w \times k^{2} \times C p^{2} \end{aligned}$ (2) presents the FLOPs of PConv. (2) $\begin{aligned} h \times w \times k^{2} \times C p^{2} \end{aligned}$ (2) When the number of input channels of $Cp$ is $1 / 4$ , PConv's FLOPs are just $1 / 16$ of Conv2D's FLOPs, and the number of memory accessed is only $1 / 4$ of Conv2D. The computational publicity for PConv access to memory is shown in publicity (3). (3) $\begin{aligned} h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p} \end{aligned}$ (3)

Figure 3. FasterBlock structure diagram.

As shown in Figure , four identical FasterBlock layers make up the FasterNet backbone network, and each FasterBlock layer is preceded by an additional layer (the first one is an Embedding layer, and the rest are Merging layers) for downsampling or channel expansion of the feature maps. This study mainly uses the FasterNet-M network, which has 3, 4, 18, and 3 FasterBlock layers in each stage. Figure (b) illustrates the structure of a FasterBlock, composed of a PConv layer and two 1 × 1 Conv layers. Adding BatchNorm (BN) layers and the SiLu activation function between the two Conv layers enhances the network's generalisation ability and training efficiency. Replacing the YOLOX backbone network with the FasterNet backbone network maintains good feature extraction while reducing redundant computation and memory bearings.

Ship detection networks frequently use lower-level map features, which have narrow receptive fields, to detect small ships. This allows for the high-resolution capture of rich, detailed data. On the other hand, higher-level feature maps, which have greater global and semantic information and bigger receptive fields, are used to detect larger ships even though their resolution is lower. However, the standard YOLOX backbone network has a limitation in feature fusion. It primarily fuses features from the S2, S3, and S4 layers, overlooking the contribution of detailed features from shallow-level feature maps, which leads to unsatisfactory performance in detecting small ships. Although the shallow S2 layer in the YOLOX network provides valuable detailed information and clear resolution, its limited receptive field and weak semantic details may result in misclassifying small ships as background, impeding overall detection performance.

Due to the high proportions of small-sized vessels in some datasets, such as HRSID, the conventional YOLOX is ineffective at efficiently detecting these small ships. As a result, we integrate the shallow-level feature map from the FasterNet backbone network's S1 layer as an extra output layer into the feature fusion network, greatly enhancing the network's capacity to identify small ships.

3.3.2. Feature bi-level routing transformation module

Specifically, when analyzing the local feature extraction process for SAR images of small ships using the FasterNet backbone network, the downsampling effect may result in some small ships going undetected. Moreover, lower-level feature maps suffer from a limited receptive field, leading to inadequate capture of global information within convolutional neural networks. In order to address these issues and improve the model's capacity for feature extraction for multi-scale ships while simultaneously enhancing the precision of tiny ship detection, we provide the feature bi-level routing transformation module (FBM). This module may efficiently gather extensive global and contextual data without appreciably raising computational complexity, which will ultimately increase the precision and generalizability of the detection results.

Figure depicts the overall architecture of FBM, which consists of a core module called Bi-Level Routing Attention (BRAttention) (Zhu et al., Citation2023). There are two phases in this attention mechanism. The image is divided into coarse-grained regions containing several pixels in the initial phase. The correlation between every pair of coarse-grained regions is computed based on the Q and K, resulting in a correlation matrix. Subsequently, this matrix undergoes sparse processing, where only the top-k elements with the highest values are retained, while the rest are set to zero. This step allows focusing on the most essential information about specific regions.

Figure 4. FBM module. (BRAttention: Bi-Level routing attention mechanism)

The coarse-grained sparse matrix obtained from the first stage is used for fine-grained self-attention calculation in the second stage. Each patch only attends to patches within the same coarse-grained region as itself, enabling more detailed and localised attention computation.

The specific implementation details of BRAttention are as follows:

Region division and input mapping: input feature map $X \in R^{H * W * C}$ , the input feature map is divided into S*S regions, then each region contains a feature vector, $X^{r} \in R^{s^{2} * (HW / S^{2}) * C}$ for feature map X reshapes from, then Q,K,V as shown in Equation Equation(4)(5) $\begin{aligned} A^{r} = Q^{r} (K^{r})^{T} \end{aligned}$ (5) : (4) $\begin{aligned} Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W \end{aligned}$ (4)

where the relative weights of q, k, and v are

W^{q}, W^{k}, W^{v} \in R^{C * C}

Construct the index matrix between regions: first obtain the $Q^{r}, K^{r} \in R^{S^{2} * C}$ of the area by averaging each area of Q and K, and then calculate $A^{r} \in R^{S^{2} * S^{2}}$ by equation Equation(5)(6) $\begin{aligned} I_{r} = topkIndex (A^{r}) \end{aligned}$ (6) : (5) $\begin{aligned} A^{r} = Q^{r} (K^{r})^{T} \end{aligned}$ (5)

The two terms in the matrix

A^{r}

represent the degree of semantic association between the two regions. Then, to create an index matrix

I_{r} \in N^{S^{2} * K}

, the

A^{r}

matrix is pruned by keeping only the top-k connections for each region, as shown in Equation Equation(6):

(6)

\begin{aligned} I_{r} = topkIndex (A^{r}) \end{aligned}

(6)

Fine-grained attention calculation: Since the corresponding small regions in the $I_{r}$ matrix are scattered over the whole feature map, the key-value pairs need to be collected first, as shown in Equation Equation(7)(8) $\begin{aligned} O = Attention (Q, K^{g}, V^{g}) + LCE (V) \end{aligned}$ (8) : (7) $\begin{aligned} K^{g} = gather (K, I^{r}), V^{g} = gather (V, I^{r}) (K^{g}, V^{g} \in R^{S^{2} \times \frac{kHW}{S^{2}} \times C}) \end{aligned}$ (7)

Finally, we can calculate Attention on the collected key-value pairs as shown in Equation Equation(8):

(8)

\begin{aligned} O = Attention (Q, K^{g}, V^{g}) + LCE (V) \end{aligned}

(8) LCE(V) for the local context enhancement module, found in the literature. (C. Chen et al., Citation2022; Sun, Leng, et al., Citation2021)

Finally, the FBM module uses two branches to extract the input features. One branch retains the original features, which helps to maintain certain details and local characteristics of the original input data to avoid losing such information during the feature fusion process. The other branch passes through the BRAttention based on sparse sampling. On the one hand, this operation retains fine-grained detail information and enhances the fusion of context between individual regions, thus enhancing the model's capability for multi-scale ship detection. On the other hand, the BRAttention operation involves GPU-friendly dense matrix multiplication and is based on sparse sampling, which is conducive to improving computational efficiency.

By adding FBM after the S4 layer of FasterNet, the model's expressiveness and generalisation abilities are enhanced. It can also learn the correlation of particular regions or features, lessen the impact of ship detection in complex scenes, and suppress noise and extraneous information, like clutter in the ocean. Furthermore, the contextual data can be connected to improve the correlation between ships through continuous learning, which will improve the network's overall detection capabilities and the ability to recognise multi-scale ships.

3.4. PFM-based multi-scale feature pyramids

The YOLOX feature fusion network, PAFPN, effectively combines characteristics from several feature pyramid levels, leading to improved target detection speed and fine-grained feature path design. However, this efficient fusion poses challenges during training due to its complexity.

The features of the samples are less since there are fewer ships overall in a SAR image. There are multi-scale elements on the ships. In order to improve the network model's detection performance for tiny targets, this study inputs the features derived from the S1 layer into the PAFPN in the FasterNet backbone network and constructs an additional layer to fuse the input features from the S1 layer. But this design necessitates additional model computation. Thus, the PFM (PConv Feature Module) module is introduced in this study to decrease the model computation and enhance the detection performance.

3.4.1. Partial convolution-based PFM module

Designing an effective model structure is critical to capture multiscale information, handling unbalanced data, and improving model performance when involved in multiscale ship inspection. Figure depicts the framework of the PFM module, which draws inspiration from CSPNet (Wang et al., Citation2020). The module comprises two main components: PBS and FasterBottleNeck. To elaborate, the PFM module initially splits the input into two distinct sections. One section undergoes processing through PBS, reducing the channel size by half. Then, this section proceeds through the FasterBottleNeck, with the two branches cascading together. In order to return the channel size to the input size, PBS is finally used. Two 1 × 1 convolutions come after a PConv layer inside the FasterBottleNeck. After every convolutional layer, Batch Normalization and SiLu activation functions are applied to improve the model's nonlinear learning capabilities and avoid overfitting.

Figure 5. PFM module structure diagram.

Using PConv convolution within the PFM module reduces computational complexity and memory access frequency, improving representational power and enhancing feature fusion quality. Furthermore, considering the diverse sizes and shapes of small vessels, the imbalanced distribution between large and small vessel samples, and the low contrast with the surrounding environment, the cross-stage connection structure in the PFM module effectively captures multi-scale and semantic information, thereby enhancing accuracy and robustness in target detection. Consequently, the PFM module successfully balances model computation and accuracy improvements, making it highly beneficial for ship target detection tasks.

3.4.2. Feature pyramid for multi-scale fusion

In detecting SAR vessels, the model's detection accuracy is significantly affected by numerous small vessels, imbalanced multi-scale features, and complex background interferences such as ocean waves. Considering these circumstances, this paper proposes improvements to the original YOLOX feature fusion network, PAFPN.

Since detailed features are abundant in shallow networks and vessels in SAR images and occupy a small space, along with the prevalence of smaller vessel targets, a multi-scale feature pyramid structure can fuse and utilise multi-scale information across different levels. This approach aids in extracting distinguishing features between vessels and backgrounds, enhancing the model's ability to separate targets from backgrounds. Consequently, the model can more accurately locate and identify vessels, reducing instances of both missed detections and false alarms.

Therefore, this paper augments the feature fusion network with shallow-level features, integrating contextual semantic information from four feature maps sized at 144 × 144, 288 × 288, 576 × 576, and 1152 × 1152. The objective of this augmentation is to improve the model's multi-scale vessel detection performance in intricate backgrounds. The constructed PFM-based multi-scale feature fusion pyramid structure, illustrated in Figure , initially captures various details and features from input feature maps of different resolutions. Subsequently, it performs upsampling on lower-resolution features and connects them with higher-level feature maps. These upsampled feature maps are then downsampled and connected with the original, non-downsampled feature maps. Finally, the output is generated.

Figure 6. Multi-scale feature fusion pyramid based on PFM module.

Our model captures information of different scales across these various levels of feature maps, facilitating a comprehensive understanding of the scene for subsequent multi-scale vessel detection.

4. Experiment

In this chapter, we do numerous tests to verify the efficacy of the model presented in this work. The experimental setup and conditions are described in Section 4.1; the dataset and experimentation methods are described in Section 4.2; the evaluation metrics of the model's performance are described in Section 4.3; the ablation experiments are described in Section 4.4; the experiments comparing the performance of the FastPFM with the twelve target detection models are compared in Section 4.5; the experiments comparing the generalisation ability and near-shore detection ability of the FastPFM with the twelve target detection models are compared in Section 4.6.

4.1. Experimental environment

All experimental environments were implemented on the server for this study, utilising the hardware configuration and software environment outlined in Table . The hyperparameter settings of the proposed model FastPFM are detailed in Table .

Table 1. Experimental software environment and hardware configuration.

Download CSV Display Table

Table 2. FastPFM hyperparameters.

Download CSV Display Table

4.2. Datasets

In this investigation, we used two standard datasets, HRSID (Wei et al., Citation2020) and SSDD (Zhang, Zhang, Li, et al., Citation2021) to investigate the generalizability and efficacy of the FastPFM model. These datasets encompass diverse scenarios, including coastal areas, open seas, and variations in ship sizes and sea conditions. Consequently, they offer crucial support in constructing a reliable ship detection model. During practical applications, these datasets necessitate specific considerations, such as:

Complex sea waves: Due to the complexity of marine weather, various complex sea wave patterns can occur, which may cause misleading detection results by the model.
Indistinct features of small ships: These datasets contain many small ship targets with indistinct features, posing a challenge to the detection performance of the model.
Dense ship traffic in coastal ports: A high density of docked ships in coastal areas requires the model to have effective detection and discrimination capabilities for densely packed targets.

To address the problems above, the HRSID and SSDD datasets are crucial. These datasets include a wide variety of target samples and situations, enabling the model to be suitably trained and assessed under numerous difficult scenarios in order to determine the FastPFM model's performance and generalizability.

SSDD. The SSDD dataset, which consists of 1160 photos with resolutions ranging from 1 to 15 meters and typically at a size of 500 × 500 pixels, is an open image dataset for SAR ship detection. SSDD includes 2540 ships, with proportions of small, medium, and large ships at 60.2%, 36.8%, and 3%, respectively. It covers diverse scenes such as ports, coastal environments, and open sea surfaces, adding a certain level of challenge. The introduction of the SSDD dataset establishes a standardised and publicly available benchmark for evaluating and comparing various SAR ship detection algorithms, further advancing research and applications in the field of ship detection.

HRSID. The 800 × 800 pixel HRSID collection consists of 5604 SAR pictures with resolutions ranging between 0.5-3 meters. The dataset offers complex imaging scenes, including ports, docks, coastal environments, and simple sea surface scenes, each containing different numbers and sizes of ship targets. A total of 16,951 examples of ships are included, with small, medium, and large ships accounting for 54.5%, 43.5%, and 2% of the total. The HRSID is intended to address the shortcomings of current SAR ship datasets, notably for CNN-based ship detection. The SAR images in HRSID capture ship features with greater precision and detail when compared to the SSDD dataset, providing a more challenging and opportunistic platform for training and evaluating SAR ship detection algorithms.

Furthermore, to provide a more precise comparison of ship quantities across different sizes in both datasets, we present the statistics of ship quantities for three ship sizes in Figure 's style of bar charts. Observing the charts, we can see that in both datasets, the number of medium and small ships far exceeds that of large ships. This data distribution characteristic enables the proposed model to perform well in recognising small ships and multiple scales identifying ships. However, the dataset's uneven distribution of ship sizes might impair the model's ability to recognise large ships.

Figure 7. Three-size ship count in two datasets.

4.3. Experimental evaluation metrics

This work employs the MS COCO (Lin et al., Citation2014) assessment standard to evaluate each model's effectiveness precisely. The assessment criteria used to gauge the efficacy of object detection are mean average precision (mAP), average recall (AR), and average precision (AP). It is feasible to evaluate several models and confirm the efficacy of the suggested FastPFM on the HRSID and SSDD thanks to these evaluation criteria. The following explains the calculation procedures and significance of these measures.

The four parameters, TP, FP, TN, and FN, are the most commonly used among these evaluation metrics. TP represents the number of correctly identified ships, FP is the number of incorrectly detected ships, TN is the number of correctly detected backgrounds, and FN is the number of missed ships.

Recall is the percentage of correctly predicted samples among all real objects, while accuracy is the percentage of correct samples among the target objects that the model predicts. Equations Equation(9)(10) $\begin{aligned} Recall & = \frac{TP}{TP + FN} \end{aligned}$ (10) and Equation(10)(11) $\begin{aligned} AP = \int_{0}^{1} P (R) dR \end{aligned}$ (11) show how they can be computed differently. (9) $\begin{aligned} Precision & = \frac{TP}{TP + FP} \end{aligned}$ (9) (10) $\begin{aligned} Recall & = \frac{TP}{TP + FN} \end{aligned}$ (10) The formula of AP is shown in (11): P is Precision, and R is Recall. The value of AP can be obtained by calculating the area under the PR curve. This indicates that the model's precision and recall can be included in the AP computation, which can summarise the model's overall performance. (11) $\begin{aligned} AP = \int_{0}^{1} P (R) dR \end{aligned}$ (11)

4.4. Ablation experiments

We ran experiments on the benchmark dataset HRSID in order to thoroughly assess the efficacy of the suggested method and the contributions of each module. To be more precise, we varied the ways in which we integrated the FasterNet, PFM, and FBM modules into the foundational YOLOX model and then evaluated the results. The experimental results are presented in Table . When using only the FasterNet as the backbone network in FastPFM, the model's accuracy improved by 2.7%, with a remarkable 11.4% increase in detection accuracy for large ships. The detection accuracy was then increased by 1.7% and 1.9% when the PFM and FBM modules were added to the FasterNet model compared to those without these modules. The proposed model's detection accuracy outperformed the YOLOX model by 5.4% when all modules above were combined.

Table 3. Ablation results on HRSID for different combinations. “Baseline” indicates the YOLOX model we used, and “FasterNet,” “PFM,” and “FBM” represent the various modules mentioned above. No.1-5 indicates different combinations, whereas No.5 shows our proposed FastPFM model. “√”means the module chosen. AP₅₀ refers to the average accuracy at the IoU threshold of 0.5, and AP_S, AP_M, and AP_L are the average accuracies for small, medium, and large targets, respectively. The best results are in bold.

Download CSV Display Table

Notably, the limited presence of large-sized ship samples in the HRSID dataset adversely affected the performance of the baseline YOLOX model, yielding a modest detection accuracy (AP_L) of only 4.6% for large ships. However, the proposed FastPFM method in this study significantly improved the detection accuracy for large ships, achieving a remarkable 26.9% enhancement compared to the YOLOX model. This result indicates the successful mitigation of the challenge arising from the dataset's imbalanced distribution of ship sizes, achieved by the novel approach proposed in this research.

These findings validate the fundamental contributions of each design module and show the overall validity and soundness of the suggested methodology. The experimental findings clearly demonstrate the method's capacity to handle the difficulties presented by the features of the dataset, such as the dearth of large ship samples. The usefulness of the suggested method and its capacity to achieve precision in the identification of multi-scale vessels are demonstrated by the success attained in detecting small, medium, and large vessels (6.9%, 13.6%, and 26.9% improvement over the baseline model, respectively).

Furthermore, we compare the PR(Precision-Recall) curves between FastPFM and YOLOX for clarity. As shown in Figure , the x-axis represents recall, and the y-axis represents precision. The average precision (AP) value is represented by the region beneath the curve, which is constrained by the curve and the axes. Notably, the FastPFM curve's enclosed area exceeds the YOLOX curve’s. These findings unequivocally show that the FastPFM model surpasses the YOLOX model in terms of overall performance.

Figure 8. Comparison of PR curves of FastPFM and YOLOX at Iou = 0.5.

4.5. Comparison experiment

To verify the resilience and detection efficacy of our algorithms, we run multiple comparative experiments. First, we use the HRSID and SSDD datasets to compare our model with the most advanced methods available today. Second, we validate the model generalisation capability on the SSDD dataset and specifically analyze complex scenarios like offshore and dense shipping lanes on the HRSID dataset to obtain a deeper understanding of the performance of our model for multi-scale target detection in complex backgrounds. These assessments clearly show how well our model performs in terms of efficacy and dependability.

4.5.1. Experimental comparison of FastPFM with existing models on HRSID

Furthermore, we do comparative experiments with 12 other important algorithms on the HRSID dataset in order to illustrate the performance of the FastPFM model. These include SSD, PVT (Wang, Xie, et al., Citation2021), FCOS (Tian et al., Citation2019), TOOD (Feng et al., Citation2021), YOLOX, YOLOF (Chen et al., Citation2021), DeformableDETR (Zhu et al., Citation2020), NAS-YOLOX (Wang et al., Citation2023), Swin-PAFF (Y. Zhang et al., Citation2023), YOLO-SD (Wang et al., Citation2022), FINet (Hu et al., Citation2022), and Zhu et al. (Zhu et al., Citation2022) proposed to algorithms. Using the HRSID and SSDD datasets, the FastPFM model suggested in this research is compared. Table displays the outcomes of the experiment. To be more precise, the first seven of these techniques are SOTA techniques currently used in the target detection area, and the remaining five techniques are target detection algorithms designed especially for target detection on SAR ships.

Table 4. Results of multiple models’ detection on the HRSID dataset.

Download CSV Display Table

Table displays the model's detection results for small, medium, and big ships, denoted as AP_S, AP_M, and AP_L, respectively. These results can accurately illustrate our model's detection capacity for multi-scale ships. The accuracy of our model reaches 92.1%, 68.6%, 69.3%, and 31.5% on AP₅₀, AP_S, AP_M, and AP_L, respectively. YOLOX increased by 5.5%, 7%, 13.7%, and 26.7% compared to the baseline model. These results are an excellent visual demonstration of the capability of our model for multi-scale ship detection. In particular, our proposed method achieves a detection performance of 68.6% for small ships, an average improvement of 9.8% with the compared models. This is in line with our model's apparent superiority over other models at the tiny ship detection level, where it is just 0.5% less than FINet. This suggests that FINet's feature interaction module is capable of successfully extracting small ship features. However, our model is 1.6% higher than FINet under the AP₅₀ metric. Our model achieves the best detection accuracy under the APM metric compared to other models, which is 9.2% higher than other models on average. It's also noteworthy that none of the analyzed models perform well in detecting huge ships. This could be because there are far less large ships in the dataset than there are small and medium-sized ships. As a result, the algorithms’ capacity to extract features and detect huge ships is weakened. Even with this issue, our suggested model outperforms the other models in terms of large ship detection accuracy by an average of 16.08%, which is enough to show that our model is capable of detecting large ships.

4.5.2. Comparison of FastPFM's generalisation ability on SSDD with existing models

One of the most important criteria for assessing a model is its capacity for generalisation. Using the SAR image dataset SSDD, which is not used for training, we compare the suggested model FastPFM with other more sophisticated detection methods in order to more thoroughly test its detection capacity. The results are displayed in Table . The table shows that FastPFM improves by 4.4%, 1.9%, 11.6%, and 5.9% over Baseline in AP₅₀, AP_S, AP_M, and AP_L measures, respectively, when the SSDD dataset is used for the generalisation ability test. FastPFM improves over Baseline to different extents in all of the metrics, and most of the metrics are better than the models compared. It is clear from the table that FastPFM's detection capability is almost optimal. Regarding multi-scale ship detection, our model is 14.65%, 16.69%, and 3.52% better than other models in AP_S, AP_M, and AP_L on average. This shows that FastPFM's ability to detect multi-scale ships is also better than other models under generalised capability verification. For large ships, the gain in detection is still negligible. In addition to the model's performance, the SSDD dataset's low data volume and lack of representativeness have a significant contributing factor. These outcomes show the FastPFM model's superior performance over alternative techniques and confirm its outstanding performance for ship target detection tasks.

Table 5. Detection results of FastPFM and other models on the SSDD dataset.

Download CSV Display Table

4.5.3. Ship inspection results in complex nearshore environments

Due to the presence of many buildings in the nearshore image and densely docked multi-scale ships, this situation can significantly impact ship detection. To further test the model's ability to identify ability for complicated situations, this research compares FastPFM with ten other models for nearshore ship detection using the SSDD dataset. It should be noted that most advanced models, including ours, tend to have lower detection accuracy in nearshore environments, considering that the nearshore environments are interfered with by complex backgrounds such as shoreline buildings and densely docked ships. Table displays the experimental findings. Using the AP₅₀ metric, the FastPFM model outperformed the baseline model YOLOX by 4%. For large, medium, and small objects, the detection accuracy is increased by 10.9%, 9.1%, and 3%, respectively. Furthermore, in the four measures of AP₅₀, AP_S, APM, and APL, respectively, FastPFM's detection accuracy outperforms the other ten models by an average of 12.9%, 15.96%, 9.6%, and 10.45%. These findings confirm that the FastPFM model outperforms other models in ship detection close to shore and further highlight the technology's remarkable durability and capacity for generalisation.

Table 6. Detection results of FastPFM and other models on the SSDD nearshore dataset.

Download CSV Display Table

4.6. Visual results analysis

We chose three scenarios from the HRSID dataset for investigation in order to more clearly illustrate the detection capability of the model suggested in this paper. From these visualisations, it is intuitive to see that our model is superior to others. The three scenarios we selected for visual validation are Figure (a), a ship map near a port; Figure (b), a small ship map offshore. Figure (c) is a map of dense small ships in a narrow channel near shore in a complex background. The findings demonstrate that, even in complicated backgrounds, our suggested method is capable of extracting ship characteristics with high accuracy, particularly in near-shore locations where it is challenging for the human eye to discern between accurate ships and noise. In Figure (a), even though there are many buildings close to the shore with some similarity to the ship targets, FastPFM does not produce any misdetections, successfully avoiding misjudging the buildings on the shore, and does not miss detecting the ships densely docked in the harbour. In contrast, other models are easily affected by buildings on shore in this scenario, resulting in missed detection or misidentification. In Figure (c), due to the large amount of background interference and densely populated small ships, the model is subject to misdetection or missed detection. However, while other methods show obvious errors, our proposed FastPFM model only shows occasional misdetections. This is because of how the FBM and PFM modules are designed, and the FastPFM model can successfully enhance the identification of small boats that are widely dispersed. These visualisation results demonstrate that although FastPFM can still maintain high accuracy in the case of highly similar ship targets and shoreline building backgrounds, there will be detection errors for small, densely distributed ships in nearshore environments. However, it can still effectively detect multi-scale ship targets within an acceptable error range.

Figure 9. Demonstrates the HRSID dataset's detection results for several models. The green boxes show the target ships’ annotated positions in the dataset. The red boxes represent the expected boxes discovered by the models. The yellow boxes represent the detection boxes for missed targets, and the blue boxes show the detection boxes for false detection targets.

Furthermore, we have chosen three photos to illustrate the FastPFM model's capacity for generalisation: Figure (a) near-shore environment, Figure (b) dense ships in the far sea, and Figure (c) narrow channel. In these three environments, the Baseline model produces many misdetections and omissions, and our model can still identify overlapping ships and harbour vessels better than the Baseline model despite some omissions and misidentification of targets close to land.

Figure 10. Generalisation results of YOLOX and FastPFM on the SSDD dataset. The green boxes show the target ships’ annotated positions in the dataset. The red boxes represent the expected boxes discovered by the models. The yellow boxes represent the detection boxes for missed targets, and the blue boxes show the detection boxes for false detection targets.

5. Conclusion

This work presents the FastPFM ship detection model, which is based on YOLOX, to address false alarms and missed detections in ship detection in SAR images with complex backgrounds. With the goal of improving ship detection accuracy, the model uses FasterNet as its backbone network and adds the FBM and PFM modules to improve semantic features and optimise feature representation. We evaluate the FastPFM model on the HRSID dataset, comparing it against twelve other object detection models. The experimental results demonstrate FastPFM's proficiency in detecting small-sized ships, even with limited samples for large ships during training. Additionally, the generalisation test on the SSDD dataset reveals significant performance improvement compared to other models. In the detection experiments targeting coastal ships, the FastPFM model also performs satisfactorily.

We adopt the deep learning method to solve some problems in ship target detection and achieve good results, fully reflecting the application prospects of deep learning on ship detection tasks. Deep learning is able to replicate the significant process of human visual learning cognition by learning a vast quantity of sample data, allowing it to autonomously extract visual features from ship photographs and achieve automatic detection.

In summary, the approach presented in this research has promising applications in disaster monitoring, rescue, and maritime safety monitoring and traffic management. In marine safety monitoring and traffic management, the algorithm can detect abnormal situations such as illegal fishing vessels and piracy, thus improving the ability of maritime safety monitoring. It can also help detect the position and heading of ships and provide real-time ship traffic and sailing status to optimise ship traffic flow. Our method can detect vessels in distress in time for disaster monitoring and rescue, thus guiding rescue operations and improving rescue efficiency.

Despite the FastPFM model's success in identifying multi-scale ships in complex surroundings, tiny ships, there are still limitations. These include missed detections and false alarms when dealing with dense small ships in complex backgrounds. To address these limitations, future research should focus on two aspects. Firstly, investigating the angle-based SAR ship target recognition technique and strengthening the network model's security to increase the accuracy and safety of ships in complex circumstances (Chen et al., Citation2024; C. Chen et al., Citation2023; Han et al., Citation2020; Han et al., Citation2021; Li, Han, Weng, et al., Citation2022; Li, Han, Zheng, et al., Citation2022). Secondly, we strive to develop lightweight design approaches for the model to enable efficient deployment on mobile platforms. The goal of these further studies is to improve the existing approach and offer a ship detection solution that is more dependable, secure, and effective.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research is supported by the National Key Research and Development Program of China under Grant 2021YFC2801001, the Natural Science Foundation of Shanghai under Grant 21ZR1426500, and the 2022 Graduate Top Innovative Talents Training Program at Shanghai Maritime University under Grant 2022YBR005.

References

Cai, Z., & Vasconcelos, N. (2019). Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1483–1498. https://doi.org/10.1109/TPAMI.2019.2956516
Web of Science ®Google Scholar
Chen, C., Han, D., & Chang, C.-C. (2024). MPCCT: Multimodal vision-language learning paradigm with context-based compact transformer. Pattern Recognition, 147, 110084. https://doi.org/10.1016/j.patcog.2023.110084
Web of Science ®Google Scholar
Chen, C., Han, D., & Chang, C.-C. J. P. R. (2022). CAAN: Context-aware attention network for visual question answering. Pattern Recognition, 132, 108980. https://doi.org/10.1016/j.patcog.2022.108980
Web of Science ®Google Scholar
Chen, C., Han, D., & Shen, X. (2023a). CLVIN: Complete language-vision interaction network for visual question answering. Knowledge-Based Systems, 275, 110706. https://doi.org/10.1016/j.knosys.2023.110706
Web of Science ®Google Scholar
Chen, J., Kao, S.-h., He, H., Zhuo, W., Wen, S., Lee, C.-H., & Chan, S.-H. G. (2023). Run, don't walk: Chasing higher FLOPS for faster neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Chen, J., Wang, Q., Peng, W., Xu, H., Li, X., & Xu, W. (2022). Disparity-based multiscale fusion network for transportation detection. IEEE Transactions on Intelligent Transportation Systems, 23(10), 18855–18863. https://doi.org/10.1109/tits.2022.3161977
Web of Science ®Google Scholar
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., & Sun, J. (2021). You only look one-level feature. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Du, Z., Liu, R., Liu, N., & Chen, P. (2008). A New Method for Ship Detection in SAR Imagery Based on Combinatorial PNN Model. 2008 First International Conference on Intelligent Networks and Intelligent Systems.
Google Scholar
Feng, C., Zhong, Y., Gao, Y., Scott, M. R., & Huang, W. (2021). Tood: Task-aligned one-stage object detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
Google Scholar
Fu, J., Sun, X., Wang, Z., & Fu, K. (2020). An anchor-free method based on feature balancing and refinement network for multiscale ship detection in SAR images. IEEE Transactions on Geoscience and Remote Sensing, 59(2), 1331–1344. https://doi.org/10.1109/TGRS.2020.3005151
Web of Science ®Google Scholar
Gao, F., He, Y., Wang, J., Hussain, A., & Zhou, H. J. R. S. (2020). Anchor-free convolutional network with dense attention feature aggregation for ship detection in SAR images. Remote Sensing, 12(16), 2619. https://doi.org/10.3390/rs12162619
Web of Science ®Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430. https://doi.org/10.48550/arXiv.2107.08430
Google Scholar
Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE international conference on computer vision.
Google Scholar
Han, D., Pan, N., & Li, K.-C. (2020). A traceable and revocable ciphertext-policy attribute-based encryption scheme based on privacy protection. IEEE Transactions on Dependable and Secure Computing, 19(1), 316–327. https://doi.org/10.1109/TDSC.2020.2977646
Web of Science ®Google Scholar
Han, D., Zhu, Y., Li, D., Liang, W., Souri, A., & Li, K.-C. (2021). A blockchain-based auditable access control system for private data in service-centric IoT environments. IEEE Transactions on Industrial Informatics, 18(5), 3530–3540. https://doi.org/10.1109/TII.2021.3114621
Web of Science ®Google Scholar
Hu, Q., Hu, S., Liu, S., Xu, S., & Zhang, Y. D. (2022). FINet: A feature interaction network for SAR ship object-level and pixel-level detection. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15. https://doi.org/10.1109/TGRS.2022.3222636
Google Scholar
Jiang, J., Fu, X., Qin, R., Wang, X., & Ma, Z. (2021). High-speed lightweight ship detection algorithm based on YOLO-v4 for three-channels RGB SAR image. Remote Sensing, 13(10), 1909. https://doi.org/10.3390/rs13101909
Web of Science ®Google Scholar
Leng, X., Ji, K., Yang, K., & Zou, H. (2015). A bilateral CFAR algorithm for ship detection in SAR images. IEEE Geoscience and Remote Sensing Letters, 12(7), 1536–1540. https://doi.org/10.1109/LGRS.2015.2412174
Web of Science ®Google Scholar
Li, D., Han, D., Weng, T.-H., Zheng, Z., Li, H., Liu, H., Castiglione, A., & Li, K.-C. (2022). Blockchain for federated learning toward secure distributed machine learning systems: A systemic survey. Soft Computing, 26(9), 4423–4440. https://doi.org/10.1007/s00500-021-06496-5
PubMed Web of Science ®Google Scholar
Li, D., Han, D., Zheng, Z., Weng, T.-H., Li, H., Liu, H., Castiglione, A., & Li, K.-C. (2022). MOOCschain: A blockchain-based secure storage and sharing scheme for MOOCs learning. Computer Standards & Interfaces, 81, 103597. https://doi.org/10.1016/j.csi.2021.103597
Web of Science ®Google Scholar
Li, H., Han, D., & Tang, M. (2021). A privacy-preserving storage scheme for logistics data with assistance of blockchain. IEEE Internet of Things Journal, 9(6), 4704–4720. https://doi.org/10.1109/JIOT.2021.3107846
Web of Science ®Google Scholar
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision.
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
Google Scholar
Lin, Z., Ji, K., Leng, X., & Kuang, G. (2019). Squeeze and excitation rank faster R-CNN for ship detection in SAR images. IEEE Geoscience and Remote Sensing Letters, 16(5), 751–755. https://doi.org/10.1109/LGRS.2018.2882551
Web of Science ®Google Scholar
Liu, T., Yang, Z., Yang, J., & Gao, G. (2019). CFAR ship detection methods using compact polarimetric SAR in a K-Wishart distribution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(10), 3737–3745. https://doi.org/10.1109/JSTARS.2019.2923009
Web of Science ®Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.
Google Scholar
Migliaccio, M., Gambardella, A., & Nunziata, F. (2008, 27–29 May 2008). Ship detection over single-look complex SAR images. 2008 IEEE/OES US/EU-Baltic International Symposium.
Google Scholar
Peng, H., & Tan, X. (2022). Improved YOLOX’s anchor-free SAR image ship target detection. IEEE Access, 10, 70001–70015. https://doi.org/10.1109/ACCESS.2022.3188387
Google Scholar
Qian, L., Zheng, Y., Li, L., Ma, Y., Zhou, C., & Zhang, D. (2022). A new method of inland water ship trajectory prediction based on long short-term memory network optimized by genetic algorithm. Applied Sciences, 12(8), 4073.
Google Scholar
Rai, M. C. E., Giraldo, J. H., Al-Saad, M., Darweech, M., & Bouwmans, T. (2022). SemiSegSAR: A semi-supervised segmentation algorithm for ship SAR images. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. https://doi.org/10.1109/LGRS.2022.3185306
Web of Science ®Google Scholar
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28. https://proceedings.neurips.cc/paper_files/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html
Google Scholar
Steenson, B. O. (1968). Detection performance of a mean-level threshold. IEEE Transactions on Aerospace and Electronic Systems, AES-4, (4), 529–534. https://doi.org/10.1109/TAES.1968.5409020
Google Scholar
Sun, Z., Dai, M., Leng, X., Lei, Y., Xiong, B., Ji, K., & Kuang, G. (2021). An anchor-free detection method for ship targets in high-resolution SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 7799–7816. https://doi.org/10.1109/JSTARS.2021.3099483
Web of Science ®Google Scholar
Sun, Z., Leng, X., Lei, Y., Xiong, B., Ji, K., & Kuang, G. J. R. S. (2021). BiFA-YOLO: A novel YOLO-based method for arbitrary-oriented ship detection in high-resolution SAR images. Remote Sensing, 13(21), 4209. https://doi.org/10.3390/rs13214209
Web of Science ®Google Scholar
Tang, G., Zhuge, Y., Claramunt, C., & Men, S. (2021). N-YOLO: A SAR ship detection using noise-classifying and complete-target extraction. Remote Sensing, 13(5), 871. https://doi.org/10.3390/rs13050871
Web of Science ®Google Scholar
Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision.
Google Scholar
Wang, C., Shi, J., Zou, Z., Wang, W., & Zhou, Y. (2021). A semi-supervised SAR ship detection framework via label propagation and consistent augmentation. IEEE International Geoscience and Remote Sensing Symposium IGARSS, 4884–4887.
Google Scholar
Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020). CSPNet: A new backbone that can enhance learning capability of CNN. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
Google Scholar
Wang, H., Han, D., Cui, M., & Chen, C. (2023). NAS-YOLOX: A SAR ship detection using neural architecture search and multi-scale attention. Connection Science, 35(1), 1–32. https://doi.org/10.1080/09540091.2023.2257399
Web of Science ®Google Scholar
Wang, J., Leng, X., Sun, Z., Zhang, X., & Ji, K. (2023a). Fast and accurate refocusing for moving ships in SAR imagery based on FrFT. Remote Sensing, 15(14), 3656. https://www.mdpi.com/2072-4292/15/14/3656.
Web of Science ®Google Scholar
Wang, J., Leng, X., Sun, Z., Zhang, X., & Ji, K. (2023b). Refocusing swing ships in SAR imagery based on spatial-variant defocusing property. Remote Sensing, 15(12), 3159.
Web of Science ®Google Scholar
Wang, S., Gao, S., Zhou, L., Liu, R., Zhang, H., Liu, J., Jia, Y., & Qian, J. (2022). YOLO-SD: Small ship detection in SAR images by multi-scale convolution and feature transformer module. Remote Sensing, 14(20), 5268. https://doi.org/10.3390/rs14205268
Web of Science ®Google Scholar
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. Proceedings of the IEEE/CVF International Conference on Computer Vision.
Google Scholar
Wei, S., Zeng, X., Qu, Q., Wang, M., Su, H., & Shi, J. (2020). HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access, 8, 120234–120254. https://doi.org/10.1109/ACCESS.2020.3005861
Web of Science ®Google Scholar
Yasir, M., Shanwei, L., Mingming, X., Hui, S., Hossain, M. S., Colak, A. T. I., Wang, D., Jianhua, W., & Dang, K. B. (2023). Multi-scale ship target detection using SAR images based on improved YOLOv5. Frontiers in Marine Science, 9. https://doi.org/10.3389/fmars.2022.1086140
Web of Science ®Google Scholar
Zhai, L., Li, Y., & Su, Y. (2016). Inshore ship detection via saliency and context information in high-resolution SAR images. IEEE Geoscience and Remote Sensing Letters, 13(12), 1870–1874. https://doi.org/10.1109/LGRS.2016.2616187
Web of Science ®Google Scholar
Zhang, J., Sheng, W., Zhu, H., Guo, S., & Han, Y. (2023a). MLBR-YOLOX: An efficient SAR ship detection network with multilevel background removing modules. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 5331–5343. https://doi.org/10.1109/JSTARS.2023.3280741
Web of Science ®Google Scholar
Zhang, T., Zhang, X., & Ke, X. (2021aa). Quad-FPN: A novel quad feature pyramid network for SAR ship detection. Remote Sensing, 13(14), 2771. https://doi.org/10.3390/rs13142771
Web of Science ®Google Scholar
Zhang, T., Zhang, X., Li, J., Xu, X., Wang, B., Zhan, X., Xu, Y., Ke, X., Zeng, T., Su, H., Ahmad, I., Pan, D., Liu, C., Zhou, Y., Shi, J., & Wei, S. (2021). SAR ship Detection Dataset (SSDD): Official release and comprehensive data analysis. Remote Sensing, 13(18), 3690. https://doi.org/10.3390/rs13183690
Web of Science ®Google Scholar
Zhang, X., Huo, C., Xu, N., Jiang, H., Cao, Y., Ni, L., & Pan, C. (2021). Multitask learning for ship detection from synthetic aperture radar images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 8048–8062. https://doi.org/10.1109/JSTARS.2021.3102989
Web of Science ®Google Scholar
Zhang, Y., Han, D., & Chen, P. (2023b). Swin-PAFF: A SAR ship detection network with contextual cross-information fusion. Computers, Materials & Continua, 77(2), 2657–2675. https://doi.org/10.32604/cmc.2023.042311
Web of Science ®Google Scholar
Zhang, Y., & Hao, Y. (2022). A survey of SAR image target detection based on convolutional neural networks. Remote Sensing, 14(24), 6240. https://doi.org/10.3390/rs14246240
Web of Science ®Google Scholar
Zhao, W., Syafrudin, M., & Fitriyani, N. L. (2023). CRAS-YOLO: A novel multi-category vessel detection and classification model based on YOLOv5s algorithm. IEEE Access, 11, 11463–11478. https://doi.org/10.1109/ACCESS.2023.3241630
Google Scholar
Zhao, Y., Zhao, L., Xiong, B., & Kuang, G. (2020). Attention receptive pyramid network for ship detection in SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 2738–2756. https://doi.org/10.1109/JSTARS.2020.2997081
Web of Science ®Google Scholar
Zheng, Y., Li, L., Qian, L., Cheng, B., Hou, W., & Zhuang, Y. (2023). Sine-SSA-BP ship trajectory prediction based on chaotic mapping improved sparrow search algorithm. Sensors, 23(2), 704. http://doi.org/10.3390/s23020704
PubMed Web of Science ®Google Scholar
Zheng, Y., Liu, P., Qian, L., Qin, S., Liu, X., Ma, Y., & Cheng, G. (2022). Recognition and depth estimation of ships based on binocular stereo vision. Journal of Marine Science and Engineering, 10(8), 1153. https://doi.org/10.3390/jmse10081153
Web of Science ®Google Scholar
Zheng, Y., Zhang, Y., Qian, L., Zhang, X., Diao, S., Liu, X., Cao, J., & Huang, H. (2023). A lightweight ship target detection model based on improved YOLOv5s algorithm. PLoS One, 18(4), e0283932. https://doi.org/10.1371/journal.pone.0283932
PubMed Web of Science ®Google Scholar
Zhou, Y., Fu, K., Han, B., Yang, J., Pan, Z., Hu, Y., & Yin, D. (2023). D-MFPN: A doppler feature matrix fused with a multilayer feature pyramid network for SAR ship detection. Remote Sensing, 15(3), 626. http://doi.org/10.3390/rs15030626
Web of Science ®Google Scholar
Zhou, Y., Jiang, X., Chen, Z., Chen, L., & Liu, X. (2023). A semisupervised arbitrary-oriented SAR ship detection network based on interference consistency learning and pseudolabel calibration. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 5893–5904. http://doi.org/10.1109/JSTARS.2023.3284667
Web of Science ®Google Scholar
Zhu, H., Xie, Y., Huang, H., Jing, C., Rong, Y., & Wang, C. J. S. (2021). DB-YOLO: A duplicate bilateral YOLO network for multi-scale ship detection in SAR images. Sensors, 21(23), 8146. https://doi.org/10.3390/s21238146
PubMed Web of Science ®Google Scholar
Zhu, L., Wang, X., Ke, Z., Zhang, W., & Lau, R. W. (2023). BiFormer: Vision Transformer with Bi-Level Routing Attention. (Ed.), (Eds.). Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Zhu, M., Hu, G., Zhou, H., & Wang, S. (2022). Multiscale ship detection method in SAR images based on information compensation and feature enhancement. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–13. https://doi.org/10.1109/tgrs.2022.3202495
Web of Science ®Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. https://doi.org/10.48550/arXiv.2010.04159
Google Scholar
Zong, C., & Wan, Z. (2022). Container ship cell guide accuracy check technology based on improved 3D point cloud instance segmentation. Brodogradnja: Teorija i praksa brodogradnje i pomorske tehnike, 73(1), 23–35.
Google Scholar

FastPFM: a multi-scale ship detection algorithm for complex scenes based on SAR images

Abstract

1. Introduction

2. Related work

2.1. Traditional ship detection algorithm

2.2. Deep learning based SAR ship detection algorithm

2.2.1. Deep learning based two-stage SAR ship detection algorithm

2.2.2. Deep learning-based one-stage SAR ship detection algorithm