250
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Night construction site detection based on ghost-YOLOX

, , &
Article: 2316015 | Received 17 Jul 2023, Accepted 02 Feb 2024, Published online: 14 Feb 2024

Abstract

Existing target detection algorithms, when applied to night-time site monitoring, are limited by site hardware conditions, lighting conditions, fuzzy and small targets to be detected, etc., which makes it difficult to achieve good detection results. In this paper, we propose the Ghost-YOLOX algorithm for night-time site monitoring using YOLOX-X as a baseline model. In order to reduce the number of model parameters and improve the detection speed, the algorithm is based on the Ghost convolution and SimAM self-attention modules build the Sim-Ghost residual module according to the gradient path design strategy, and uses it to reconstruct the backbone and neck networks; In order to improve the detection ability of the network for fuzzy targets and small targets, based on Involution, a Cross Involution Attention (CIA) module is constructed by double cross convolution, and added to the neck network to enable the network to obtain more efficient channel and spatial attention. The experimental results show that compared with the original YOLOX-X algorithm, the Ghost-YOLOX algorithm reduces the number of parameters to 16.7% of the original model, and the detection speed increases by 2.3 times, at the same time, the average accuracy of the model has increased by 2.55%.

1. Introduction

With the continuous development of target detection technology, the safety issues of monitoring construction sites have gradually been replaced by computer-intelligent supervision, which uses advanced video surveillance equipment and image processing technology to realise real-time monitoring and management of hoisting work surfaces on construction sites. Among them, daytime construction site hoisting working surface monitoring technology has made great progress in construction site target detection, such as worker safety helmet monitoring, tower crane equipment use safety monitoring, etc. However, the nighttime construction site hoisting surface monitoring technology is limited by many factors such as light conditions, monitoring equipment costs, etc., and its accuracy and detection speed are far behind the daytime hoisting operation surface monitoring.

From 2017 to 2021, a total of 3,622 production safety accidents in housing and municipal projects occurred in China, resulting in 4,198 deaths. Among the accidents causing casualties, 53% were accidents caused by falling heavy objects and 12% were mechanical injuries. These were caused by starting work without knowing whether there were construction workers at the construction site (Tang, Chen, Zhang, & Zhang, Citation2020). Supervision measures at night construction sites are mostly based on manual supervision, which is time-consuming and cumbersome, especially the poor light conditions at night make it easy for supervisors to misjudge due to visual fatigue, making them unable to meet the supervision needs of high-risk construction environments. The frequent occurrence of safety accidents on construction sites will not only cause casualties among construction workers but will also seriously affect the happiness of workers’ families and bring adverse effects to society (Wang, Wang, Wu, & Yang, Citation2021). How to effectively identify construction workers in the complex night construction site environment is an important basis for ensuring the safety of workers. Therefore, it is of great significance to study a nighttime construction site monitoring algorithm with strong robustness and high accuracy.

At present, research on night target detection has also made certain progress. Gong et al. (Citation2023) used an improved proximity-sensing pedestrian detection algorithm to detect pedestrians at night. Yang et al. (Citation2023) used an algorithm combining improved YOLOv5 and AE automatic exposure to realise night operations of inspection robots. Zhang, Gao, Zhao, and Hou (Citation2023) used improved YOLOv5s and Retinex data enhancement algorithms for nighttime vehicle detection. However, when the above method is applied to nighttime construction site inspection, good detection results will not be obtained due to the complex background environment of the construction site.

Existing general night target detection methods often combine target detection algorithms and image enhancement algorithms, the detection accuracy is improved by enhancing image features, but it will bring a lot of computational cost to the network. In order to collect images with clearer features under poor light conditions at night, the monitoring equipment changes to grayscale images. Grayscale images can obtain higher resolution than RGB images under poor light conditions and blurred images (Shivanthan, Andy, Dyer, & Song, Citation2018).

Even if grayscale images are used to improve the clarity of nighttime monitoring images, there are still several challenging issues to be solved in nighttime construction site monitoring: (1) The image used for nighttime detection is a low-quality grayscale image, and the network needs to have sufficient depth and width to extract rich feature information from the image, However, a network model that is too deep and too wide cannot meet the various needs of construction site deployment, therefore, it is necessary to reduce the number of model parameters and improve detection efficiency without reducing the depth and width of the network; (2) Due to the wide scene of the construction site, the camera needs to have a wide viewing angle away from the ground, and the safety helmet itself is small in size. Most of the targets to be detected are small targets, which increases the difficulty of network identification; (3) Since the video head is at the end of the boom of the moving tower crane, and the workers on the construction site walk back and forth, the target to be detected is often a fuzzy target, which reduces the detection accuracy of the network.

To solve the above problems, this paper chooses the current excellent YOLOX (Ge et al., Citation2019) target detection algorithm as the benchmark algorithm to improve. Compared with other traditional detection algorithms, this algorithm has a deeper backbone network and can obtain better accuracy. Improvements are made for a large amount of network parameters and the shortcomings of detecting fuzzy targets and small targets: Based on Ghost convolution and SimAM, a new Sim-Ghost residual module is designed according to the gradient path design strategy, and is used to reconstruct the backbone and neck networks to reduce the number of model parameters and improve the detection speed; Based on Involution, a new attention module CIA module is designed using double cross convolution and added to the neck network to improve the network's detection ability of fuzzy targets and small targets.

2. Related work

The mainstream target detection technology applied to construction site monitoring is mainly divided into two types; one is a two-stage target detection algorithm, such as Faster RCNN (Ren, He, Girshick, & Sun, Citation2017), etc., The first stage of the two-stage target detection algorithm is responsible for generating a suitable extraction frame, and the second stage is responsible for classification and regression operations on the extracted features; however, due to its staged operation, its detection speed and parameter quantity are difficult to meet the needs of field applications. Therefore, construction site monitoring uses more single-stage target detection algorithms, such as the YOLO series (Redmon, Divvala, Girshick, & Farhadi, Citation2016; Redmon & Farhadi, Citation2017; Redmon & Farhadi, Citation2018; Bochkovskiy, Wang, & Liao, Citation2020; Glenn, Citation2020; Iandola et al., Citation2022; Wang et al., Citation2022; Hussain, Citation2023) etc.; it uses one stage to complete the extraction, classification, and regression of features in the image. Its excellent detection speed and accuracy are often directly used for on-site detection tasks.

In order to extract effective image features from complex nighttime images, a backbone feature extraction network with sufficient depth and width is required, however, this will significantly increase the computing cost of the network and cannot meet the needs of edge devices deployed on construction sites. To solve these problems, lightweight improvements to the network are needed. Currently, common lightweight networks include MobileNets, GhostNet, etc., MobileNetv1 reduces the computational cost of the network by introducing Deep Separable Convolution, but it also causes a large loss of feature map information (Howard et al., Citation2017). In order to solve the information loss problem of MobileNetv1, MobileNetv2 introduces a linear bottleneck and an inverted residual structure to reduce the damage of the ReLU activation function to nonlinear features, thereby reducing the information loss in the residual process (Sandler et al., Citation2018). MobileNetv3 introduces the SE module into the network, combines the inverse residual structure with the H-swish activation function, and selectively extracts feature information by building channel attention, thus reducing the computational workload of the model. GhostNet proposed the Ghost module, which replaces some nonlinear convolution operations with cheap operations and obtains a large number of feature maps with a small number of parameters, effectively saving network computing costs (Han et al., Citation2020). Based on existing lightweight networks, Chen et al. (Chen et al., Citation2023) used the lightweight network PP-LCNet to reconstruct the backbone network of the YOLOV4 algorithm, and used depth-separable convolution to reduce model parameters, but also reduced the feature extraction capability of the backbone network. Chen et al. (Chen et al., Citation2023) proposed an improved YOLOV 5 model, introduced the Ghost module to optimise the backbone and neck network structures to reduce the number of parameters, and used a Bi-directional Feature Pyramid Network (Bi-FPN) to reconstruct the neck network to fuse image features.

The second type of difficulty in nighttime construction site monitoring is that the targets to be detected are mostly small targets and fuzzy targets. Small target detection requires the network to fuse feature information from different feature layers, Fang et al. (Citation2021) added a small feature extractor and a reshaping pass-through layer to the YOLO network and added bypass and cascade skip connections, this combines low-level location information with more meaningful high-level information. Liu (Citation2021) enhanced the bottom-to-top information transmission path by downsampling the feature pyramid in the YOLOV4 network, and then fused feature maps of different layers to improve the small target detection effect. JF Wang et al. (Citation2023) used adaptive attention module (AAM) and feature enhancement module (FEM) to improve the feature pyramid in YOLOV5 neck network, the improved feature pyramid model AF-FPN enhances the network's multi-scale target detection capabilities by reducing the information loss during feature map generation. Fuzzy target detection requires the network to learn feature information that can effectively describe the target when fine-grained information in the image is partially lost, Cui et al. (Citation2022) proposed a general framework ResstoreDet, which utilises downsampling degradation as a way of self-supervised signal conversion to explore the intrinsic visual structure of blurred targets under various resolutions and other degradation conditions, its general framework can be used in conjunction with any mainstream target detection network, but it will bring more computational costs to the network. Sunkara and Luo (Citation2023) proposed an improved YOLOV5 model. By introducing SPD-Conv to replace the stridden convolution layer and pooling layer in the network, they can reduce the loss of fine-grained information and improve the efficiency of feature learning, thereby improving the Detection results of fuzzy targets. Liang and Li (Citation2021) used the blind motion deblurred image enhancement method of the generative adversarial network to preprocess the image, and at the same time cropped and compressed the YOLOv3 network to optimise the detection results of blurred targets.

At present, nighttime construction site monitoring technology based on the industrial mainstream target detection algorithm YOLO series has also made certain progress. Some existing research first enhances the quality of night images and then sends them to the network for detection, Shi, Xie, Zhang, Jiang, and Chen (Citation2023) used the Retinex visual model to enhance low-light image quality, and then used the YOLOV3 framework to detect the enhanced image. Wang et al. (Citation2023) used the CLAHE algorithm to enhance low-light image quality and then used an improved YOLOX framework that added a decoupling head and a ConvNet module for detection. Some studies improve the corresponding feature extraction capabilities of the network by adding targeted modules to the network, Minsoo et al. (Citation2023) based on the YOLOV5 algorithm, added a weighted triple attention mechanism and Softpool-spatial pyramid pooling fast (Softpool-SPPF) to the network to extract and retain more image spatial information, thereby improving small targets detection accuracy. Shu, Zhang, Song, Wu, and Yuan (Citation2023) are also based on the YOLOV5 algorithm and enhance the low-level features of the image by introducing a feature enhancement module and a channel attention mechanism, in the feature extraction process, the influence of noise features is suppressed, and then a feature positioning module is introduced into the neck network to enhance the network's ability to position targets.

In order to improve the detection accuracy of low-quality images, most of the existing nighttime construction site detection technologies use image enhancement algorithms to enhance the quality of image features, and a few add multiple modules to the network to improve the feature extraction capabilities of the network. However, both methods bring more model parameters and calculation time to the network, and it is difficult to achieve a better balance between detection accuracy and efficiency. Therefore, it is of great significance to study a nighttime construction site monitoring algorithm with low computational cost, strong robustness, and high accuracy.

3. YOLOX network structure

The YOLO series of algorithms have gradually become the preferred framework for most industrial applications due to their better comprehensive performance. At present, YOLOV5, YOLOV6, YOLOV7 and YOLOX are the most representative. Since the feature information extracted from night detection images is more complex, in order to balance detection accuracy and efficiency, YOLOV6 and YOLOX, which are based on anchor-free and have two branch decoupled heads, are more suitable for night construction site monitoring, whereas YOLOV6, due to the large number of heavily parameterised blocks used in its network structure, is more difficult to combine with self-built datasets to achieve quantification of the model in the industrial deployment,. Therefore, the YOLOX algorithm was finally selected as the benchmark algorithm for improvement.

The YOLOX object detection algorithm was released in 2021, a deep neural network detection model based on the Python framework, YOLOX can be divided into two major versions, including lightweight network structures: YOLOX-Nano, YOLOX-Tiny, and standard network structures: YOLOX-s, YOLOX-m, YOLOX-l, YOLOX-x, The depth and width of the four version models in the standard network structure increase sequentially. Due to the difficulty in feature extraction of fuzzy grayscale images used in nighttime construction site monitoring, this paper will use the YOLOX-x model as a benchmark model and improve it.

The YOLOX model contains three output layers of different sizes, corresponding to three decoupling heads, and its overall network model is divided into four parts: input end, backbone network, neck network, and output end. Its network structure is shown in Figure .

Figure 1. YOLOX network structure.

Figure 1. YOLOX network structure.

3.1. Input

The input of YOLOX mainly includes three parts: Mosaic data enhancement, Mixup data enhancement, and Focus structure.

Mosaic data enhancement splices images by random zooming, random cropping, and random arrangement, which can effectively improve the detection effect on small targets. Mixup data enhancement is an additional enhancement strategy based on Mosaic data enhancement. Through the custom fusion coefficient, the two images are fused with a certain weighting coefficient, which can improve the generalisation and robustness of the network.

The Focus structure is used to increase channel information. After the image is input into the Focus structure, the image will be sliced; that is, a value is taken for every other pixel value so that one image is divided into four images; this division method can expand the input channel by four times without loss of information, and improve the training speed of the network at the same time.

3.2. Backbone network

The backbone network of YOLOX adopts the Darknet53 structure; a large number of residual structures are introduced to ensure that the gradient disappears and the gradient explosion is avoided while increasing the depth of the network.

At the same time, the backbone network of YOLOX also replaced the ReLu function used in the previous series of YOLO with the SiLu function. Compared with the ReLu function, the SiLU function has a smoother curve when it is close to zero, and because of the use of the sigmoid function, the output range of the network is limited between 0 and 1. During network training, the SiLu function can retain more input information than the ReLu function.

3.3. Neck network

YOLOX's neck network adopts the FPN structure, which can obtain feature maps of various scales and integrate deep features and shallow features to meet the different needs of target classification and positioning in the detection process.

The FPN structure can select the size of the output feature layer according to the different sizes of the target to be detected, without going through all layers to output the final feature map; the detection performance of the network can be guaranteed while improving the detection speed of the network.

3.4. Output

The output of YOLOX adopts the Decouple Head structure and SimOAT label-matching strategy.

The Decouple Head structure uses two different branches to classify and locate the network respectively, to avoid the influence of the different feature information required by the two tasks on the network detection results, the Decouple Head structure effectively improves the convergence speed and accuracy of the network but increases the computational complexity.

The SimOAT label-matching strategy divides the positive and negative sample anchor boxes by calculating the cost function and can filter out more high-quality positive sample anchor boxes to match the target, avoiding some positive sample anchor boxes being filtered out when the target is occluded, thereby improving the generalisation ability of the model.

4. Ghost-YOLOX network structure

The improved Ghost-YOLOX network structure is shown in Figure . In order to reduce the number of model parameters and improve the detection speed, a Sim-Ghost residual module with a smaller number of parameters was designed based on the gradient path design strategy and Ghost convolution to reconstruct the backbone and neck network of YOLOX; In order to improve the network's detection ability of fuzzy targets and small targets, a CIA module was constructed based on the double-crossing Involution operation to enable the network to obtain more efficient channels and spatial attention.

Figure 2. Ghost-YOLOX.

Figure 2. Ghost-YOLOX.

4.1. Sim-Ghost residual module

4.1.1. Ghost convolution

The YOLOX backbone network adopts the Darknet53 structure, which contains a large number of residual structures. The basic unit of these residual structures is a nonlinear convolution operation, that is, a set of combinations of convolution – batch normalisation – nonlinear activation. The nonlinear convolution operation can output a rich feature map containing various feature information extracted from the image, but it will also lead to many feature maps with similar feature information in a large number of output feature maps. For the gray-scale fuzzy image recorded at the night construction site, due to the loss of colour information, the feature map output by the nonlinear convolution operation will have more feature information repetition, which leads to network redundancy, increasing the number of parameters unnecessarily.

The feature map extracted by the nonlinear convolution operation of the night construction site detection image is shown in Figure . The feature maps in the same colour box in the figure are those that contain very similar feature information, and these feature maps can be obtained through simple linear transformations without the need for complex nonlinear operations to increase the number of network parameters.

Figure 3. Feature map redundancy.

Figure 3. Feature map redundancy.

By solving the feature map redundancy problem described above, the network computational cost can be significantly reduced. For this purpose, the Ghost residual module consisting of Ghost convolution (Han et al., Citation2020) is used to replace the residual module based on nonlinear convolutional operations in the backbone network. The feature extraction process comparison between the Ghost module and the normal nonlinear convolution operation is shown in Figure . The Ghost module first performs an ordinary nonlinear convolution operation to output a small number of Intrinsic feature maps. Then, based on the Intrinsic feature maps, more Ghost feature maps are obtained through linear transformation, and finally, the Intrinsic feature maps and the Ghost feature maps are spliced together as output.

Figure 4. Feature extraction process comparison. (a) Nonlinear convolution operation; (b) Ghost module convolution operation.

Figure 4. Feature extraction process comparison. (a) Nonlinear convolution operation; (b) Ghost module convolution operation.

The equation for generating Ghost feature maps from Intrinsic feature maps in the Ghost module is shown in Equation (Equation1) (1) γi,j=Φi,j(yi) i=1,,m j=1,,s(1) In the equation, yi represents the i-th Intrinsic feature map, and j represents the j-th linear transformation performed by the i-th Intrinsic feature map. That is, s operations will be performed for each Intrinsic feature map, including 1 Identity operation first, and then s1 linear transformation operation. The number of Ghost feature maps n that is finally generated is shown in Equation (Equation2) (2) n=(s1)×m+m(2) The Ghost module reduces the computational cost by acquiring repeated feature maps through linear operations, and the speed-up ratio in time complexity compared to non-linear convolutional operations is shown in Equation (Equation3) (3) rs=(n×h×w×c×k×k)/[ns×h×w×c×k× k+(s1)×ns×h×w×d×d]= (c×k×k)/[1s×c×k×k+s1s×d×d] (s×c)/(s+c1)s(3) The compression ratio of the Ghost module over non-linear convolutional operations in terms of space complexity is shown in Equation (Equation4) (4) rc=(n×c×k×k)/[ns×c×k×k+(s1)× ns×d×d]=(s×c)/(s+c1)s(4)

In the equation, c is the number of input channels, n is the number of output channels, h and w is the height and width of the output feature map, k×k is the convolution kernel size of nonlinear convolution operation, d×d is the convolution kernel size of a linear transformation, and sc.

4.1.2. Ghost residual module

To effectively improve the feature extraction, feature selection, and feature fusion capabilities of the network, the residual module structure designed according to the gradient path design strategy is shown in Figure . In the figure, the Ghost1 module is a residual module for the backbone network, and Ghost2 is a residual module for the neck network. In the original YOLOX network, CSPNet increases the depth of the model by stacking the residual structure, but the stacking residual structure can only increase the longest gradient path. The Ghost residual module contains four different branches in residual module by redesigning the structure of the main feature extraction channel. It can effectively increase the shortest and longest gradient paths of the network, enable the network to learn more features, and improve the robustness of the model.

Figure 5. Ghost residual module structure. (a) Backbone residual structure; (b) Neck residual structure.

Figure 5. Ghost residual module structure. (a) Backbone residual structure; (b) Neck residual structure.

In the Ghost residual module, although Ghost convolution uses a linear transformation to simulate the generation of redundant features, there are still subtle differences between the self-generated redundant features and the redundant features generated by nonlinear convolution operations. In a single residual module, this nuance has little impact on network detection, but if the nonlinear convolution operations in the backbone network and the neck network are all replaced by Ghost convolution, it will have a greater impact on the network monitoring results. To make up for the defects of Ghost convolution in feature extraction, when constructing the residual module based on Ghost convolution, the normal convolution and Ghost convolution are used in combination. As shown in Figure . The nonlinear convolution operation is used at the beginning of the main feature extraction channel, the residual channel, and the end of the module, and the feature information generated by the nonlinear convolution operation can be infiltrated into the information output by Ghost convolution (Li et al., Citation2022). While reducing the negative impact of Ghost convolution defects on the model, the lightweight advantages of Ghost convolution are effectively utilised.

4.1.3. Sim-Ghost residual module

Replacing the residual module based on nonlinear convolution operation in the backbone network with the Ghost residual module can greatly reduce the amount of parameters and improve the detection speed of the model. However, due to the complex feature information contained in the grayscale fuzzy image used for night construction site detection, the importance of the feature information contained in the feature map obtained through linear transformation varies greatly. Therefore, introducing the SimAM self-attention module (Yang et al., Citation2021) into the Ghost residual module and inferring the 3D attention weights of different feature information in each layer feature map can effectively improve the expressive ability of the Ghost residual module.

The process of focusing on the 3D weights of the SimAM module is shown in Figure . By jointly focusing on 1D channel attention and 2D spatial attention, we evaluate the importance of different neurons in each layer of the network and give higher importance to neurons with spatial effects based on a customised energy function. The self-defined energy function is shown in Equation (Equation5) (5) et(wt,bt,y,xi)=(yttˆ)2+1M1i=1M1(y0xˆi)2(5) In the equation, tˆ=wtt+bt,xˆi=wtxi+bt, M=H×W. In order to simplify this formula, a binary label is used and a regular term is added, and the final energy function is shown in Equation (Equation6) (6) et(wt,bt,y,xi)=1M1i=1M1(1(wtxi+bt)) (1(wtxi+bt))2+λwt2(6)

Figure 6. SimAM module.

Figure 6. SimAM module.

Minimising et is equivalent to training the linear separability of different neurons in the same layer network, and this formula has an analytical solution as Equation (Equation7) (7) {wt=2(tμt)(tμt)2+2σt2+2λbt=12(t+μt)wt(7) In the equation, ut=1M1i=1M1xi, σt2=1M1i=1M1(xiut)2. The minimum energy can be obtained from Equation (Equation8) (8) et=4(σˆ2+λ)(tuˆ)2+2σˆ2+2λ(8)

Since its energy function has an analytical solution, the optimisation process requires only a few computational processes to complete.

The improved Sim-Ghost residual structure is shown in Figure . The backbone and neck network reconstructed with the Sim-Ghost residual module are named GhostNet. While GhostNet greatly reduces the amount of model parameters and improves the detection speed of the model, the network's ability to extract and express feature information has also improved. The comparison results of the four standard-level structural networks of GhostNet and YOLOX are shown in Table .

Figure 7. Sim-Ghost residual module. (a) Backbone residual structure; (b) Neck residual structure.

Figure 7. Sim-Ghost residual module. (a) Backbone residual structure; (b) Neck residual structure.

Table 1. Compares the results.

4.2. CIA module

YOLOX algorithm is easy to loses the feature information contained in the fuzzy target when processing the fuzzy grayscale image used in night construction site monitoring. Moreover, most of the helmets to be detected are small targets, and when detecting small targets, the network needs to obtain more global information. By adding an attention mechanism to the network, the network can be informed which feature information and location information of the fuzzy target in the feature map should be paid attention to; at the same time, effective global information is obtained to provide the long-range dependencies essential for vision tasks for the entire model.

In traditional attention mechanisms, the SE module (Hu, Shen, Albanie, Sun, & Wu, Citation2017) computes channel attention via 2D global pooling, but it ignores the importance of location information in capturing target feature information; the CBAM module (Woo et al., Citation2018) uses convolution to calculate spatial attention to utilise position information and reduces the input channel dimension to reduce the amount of parameters, but there are certain limitations in dealing with global context dependencies and channel spatial relationships; The CA module (Hou et al., Citation2021) adds new coordinate attention to the feature map by introducing position information into the channel attention, but its computational cost is high, and attention weight calculation needs to be performed on the entire feature map, resulting in the inability to capture long-distance dependencies.

To enable the network to simultaneously acquire efficient channel attention, spatial attention, and global context information of feature images, thereby improving the detection ability of blurred and small objects, the CIA module structure designed in this paper based on the double crossover Involution convolution method is shown in Figure .

Figure 8. CIA module structure.

Figure 8. CIA module structure.

4.2.1. Involution

Traditional convolution has two properties, spatial invariance and channel specificity. Spatial invariance means that the convolution kernel shares parameters in all spatial positions, channel specificity means that the output feature information is aggregated from all channel information of the input feature, and the parameters are not shared. However, when constructing the attention module for channel attention and spatial attention to obtain feature information, the characteristics of traditional convolution become defects, the spatial invariance limits the ability to model different local spatial positions in the feature map, and channel specificity will cause too much redundant channel information.

Based on the defects of the above-mentioned traditional convolution in building the attention module, the attention module is built based on the Involution convolution, and Involution has the opposite characteristics to traditional convolution, namely spatial specificity and channel sharing, spatial specificity means that convolution kernels are different in different spatial positions on the same feature map, channel sharing shares the same convolution kernel for each set of input features. The spatial specificity of the Involution convolution can enable the attention module to more effectively capture the long-distance relationship in space for the construction of spatial attention, and channel sharing is beneficial to reduce channel redundant information to construct more concise and efficient channel attention.

The Involution structure is shown in Figure . First, multiply the Involution kernel of K × K × 1 with the pixels in the K × K neighbourhood of the selected pixel in the input feature map point by point, and repeat on each channel of C to obtain a three-dimensional matrix of K × K × C. Then the width and height in the three-dimensional matrix are summed, and only the channel dimension is retained to obtain a 1 × 1×C vector in the output feature map. The output vector is the value of each channel of C containing a pixel position. The process of obtaining the Involution kernel can be expressed in Equation (Equation9) (9) Hi,j=Φ(Xi,j)=W1σ(W0Xi,j)W0RCrCW1R(KKG)Cr(9) The formula, W0 and W1 are two linear transformation operations to control the channel dimension in the convolution process, σ represents the nonlinear activation function used for batch normalisation. After the Involution kernel is obtained, the convolution process is as shown in Equation (Equation10) (10) Yi,j,k=(u,v)ΔKHi,j,u+[K/2],v+[K/2],[kG/C]Xi+u,j+v,k(10) In the formula, ΔK is the neighbourhood of pixel Xi,j.

Figure 9. Involution structure.

Figure 9. Involution structure.

4.2.2. Double cross convolution

In the process of obtaining spatial attention by the attention module, efficiently obtaining the context information of the whole image can effectively build the dependency relationship of the whole image. Therefore, based on the spatially specific Involution convolution, a more efficient attention module is constructed using double cross convolution.

The process of double crossover Involution is shown in Figure . The pixels in the input feature map undergo the first cross-involution convolution operation, that is, after the horizontal Involution convolution and the vertical Involution convolution in Figure , the pixel obtains its horizontal and vertical context information, and after the second cross Involution convolution operation, the pixel gathers the information of the entire image pixel. The specific process of crossing Involution to obtain global information is shown in Figure .

Figure 10. Double cross Involution.

Figure 10. Double cross Involution.

Figure 11. Global information acquisition method.

Figure 11. Global information acquisition method.

In STEP1, (ax,ay) can capture all feature dependencies on its cross paths. At the same time, (ax,by) obtains the horizontal feature dependency of (bx,by), and (bx,ay)obtains the vertical feature dependency of (bx,by); In STEP2, when (ax,ay) captures the feature dependencies on the intersection path again, the feature dependencies of (bx,by) can be obtained indirectly from (ax,by) and (bx,ay). Therefore, the acquisition of global contextual feature information can be achieved by using the double cross Involution operation.

4.2.3. CIA module

The CIA module obtains efficient channel attention through the channel-sharing feature of Involution; Capturing spatially long-distance relationships through Involution's spatial specificity, and global context information is obtained through the operation of double crossover Involution, resulting in efficient spatial attention. For the backbone and neck network GhostNet reconstructed based on the Sim-Ghost residual module in the previous section, the embedding method of the CIA module was studied, and three improved GhostNet networks based on different embedding methods were proposed.

Embed the CIA module in Backbone to form GhsotNet-B (GhsotNet-Backbone), whose structure is shown in Figure . Through the CIA module, the channel and spatial attention parameters are injected into the Feature Map during the feature extraction process of Backbone, making it easier to locate the fine-grained feature information in the image, and strengthen the network's ability to extract the main features used to distinguish targets.

Figure 12. Backbone network structure embedded with CIA module.

Figure 12. Backbone network structure embedded with CIA module.

The CIA module is embedded in Neck to form GhostNet-N (GhostNet-Neck), whose structure is shown in Figure . Through the CIA module, the channel and spatial attention parameters are injected into the Feature Map during the feature fusion and feature transfer process of Neck, so that the Feature Map can transfer and accept the target features to be detected in the image more biasedly.

Figure 13. Neck network structure embedded with CIA module.

Figure 13. Neck network structure embedded with CIA module.

Embed the CIA module in the Head to form GhostNet-H (GhostNet-Head), whose structure is shown in Figure . Through the CIA module, dual attention weight distribution in channel and space can be performed on the high-dimensional Feature Map after the fusion of the full-image contextual semantic information, it can more efficiently obtain fine-grained semantic information that is helpful for detection from high-dimensional feature data, which is conducive to the network's focus on target features.

Figure 14. Head network structure embedded with CIA module.

Figure 14. Head network structure embedded with CIA module.

After comparative testing of the above three embedding methods, the experimental results are shown in Table . It can be seen from the experimental results that embedding the CIA module into Neck makes the network more biased in the way of transmitting and merging target features, which can maximise the network's ability to detect fuzzy and small targets.

Table 2. Comparison of experimental results of different embedding methods.

The CIA module first accepts and processes the index positions of all feature information in the input feature map, and obtains the attention distribution by calculating the probability vector of the correlation between all input information and the target. For N input feature information, q is the target vector, and the expression of attention distribution αi is as follows Equation (Equation11) (11) αi=softmax(s(Xi,q))=exp(s(Xi,q))j=1Nexp(s(Xj,q))(11) After obtaining the attention distribution αi, the weight is assigned to the aggregate information V obtained by the double cross convolution, and the process of weighted summation of the aggregate information according to the weight coefficient is as shown in Equation (Equation12) (12) Attention(X,q)=i=1NαiVi(12)

Before and after adding the CIA module to Neck, the visual feature map and detection results of the network are compared as shown in Figure . In the process diagram, it can be clearly seen that after adding the CIA module, through weighted attention modelling of the full-image context information and channel information, the network pays more attention to the features of the target to be detected, it can more effectively distinguish the foreground and background of the image, and the position information is also captured more accurately.

Figure 15. Visual feature map comparison.

Figure 15. Visual feature map comparison.

In order to verify that the CIA module can more effectively help the network detect fuzzy targets and small targets in night construction site inspection tasks, based on the GhostNet network, the comparison results between CIA module and traditional attention mechanism SE module, CBAM module and CA module are shown in Table .

Table 3. Compares the results.

5. Experiments and results

5.1. Experimental environment

The experiment uses the PyTorch deep learning framework to build a night-time hoisting operation surface detection model based on the Ghost-YOLOX network. The CPU used in the experiment is Intel E5-2620 v4, the GPU is NVIDIA GeForce GTX 1080 Ti, the system is Unbtu 20.4, the Python version is 3.6, and the CUDA version is 11.4. The network parameter settings in the training process are shown in Table .

Table 4. Parameter settings.

5.2. Dataset introduction

The data set used in this paper consists of pictures obtained from multiple video frames during real night construction site construction, which contains a rich construction environment. The image is obtained through steps such as interception and screening, and the label Img tool is used to mark the target. and convert the data into PASCAL VOC format. The final data set contains 20436 images and is divided into the training set and test set according to the ratio of 9:1. The data set contains two categories: one is the “hat,” and the other is the “person.”

Figure 16. YOLOX training process record.

Figure 16. YOLOX training process record.

Figure 17. Ghost-YOLOX training process record.

Figure 17. Ghost-YOLOX training process record.

5.3. Experimental results and analysis

Drawing on the idea of transfer learning, to speed up the network fitting, use the pre-trained model that has been trained on ImageNet for training, and initialise the Ghost-YOLX network parameters according to the weight information of the pre-trained model. Network training is divided into two stages, the first 50 epochs are the frozen training stage of the frozen backbone network model, and the last 50 epochs are the global training stage. The Loss curves shown in Figures and are drawn based on the log files saved during the training process.

Figure is the training process diagram of the original YOLOX model, and Figure is the training process diagram of Ghost-YOLOX. In the figure, train loss is the loss value during model training, and val loss is the loss value during model verification. During the training process, the downward trends of train loss and val loss are relatively synchronised, which proves that the model has been effectively trained and the training results are ideal, and after the improvement, the Loss value of Ghost-YOLOX is lower, and the convergence effect of the model is better.

In order to optimise the model training process, three different optimisers, torch.optim. SGD, torch.optim.RMSprop and torch.optim.Adam are used, and comparative experiments are conducted using training process parameters that match the optimisers. The experimental comparison results are shown in Table . From the experimental results, it can be seen that there is no difference in detection accuracy, FPS, and parameter amount between models trained with different optimisers and parameters. There are only minor differences in the training time and model convergence process.

Table 5. Parameter comparison experiment.

5.3.1. Comparative test

In order to verify the advantages of the improved algorithm in this article, Ghost-YOLOX is compared with the current mainstream two-stage network Faster RCNN, single-stage network YOLO series, SSD, and Transformer-based target detection DETR. Each algorithm uses the same parameters for training and verification. The comparison results of mAP, parameter amount, and FPS of each model are shown in Table .

Table 6. Comparative experimental results.

It can be seen from the comparison of experimental results that in terms of detection accuracy, Due to the insufficient backbone network depth of the SSD and YOLOV4 algorithms, there is a large gap between the detection accuracy and other general target detection algorithms, the Ghost-YOLOX in this article has the highest mAP value, which proves that the channel and spatial attention obtained through the CIA module can effectively improve the network's detection ability of fuzzy targets and small targets; in terms of parameter quantity, by solving the feature map redundancy problem caused by nonlinear convolution operations, it ensures that Ghost-YOLOX has sufficient backbone network depth and width while also having the lowest parameter quantity; In terms of detection speed, the residual module designed according to the gradient path design strategy can effectively save the calculation time of the network, making Ghost-YOLOX have the second fastest detection speed, second only to SSD, however, considering the trade-off between detection speed and detection accuracy, the overall performance of Ghost-YOLOX is still better than SSD. In summary, Ghost-YOLOX achieves a more significant balance in detection speed, accuracy, and model parameter volume, and is more suitable for nighttime construction site monitoring than other algorithms.

5.3.2. Results of ablation experiments

In order to verify the effectiveness of the improvements in this paper, an ablation experiment was carried out with the original YOLOX-X as the benchmark algorithm. The experimental results are shown in Table . GhostNet in the table replaces the backbone and neck network of the original YOLOX with the backbone and neck network constructed by the Sim-Ghost residual module, YOLOX + CIA adds a CIA module to the neck network of the original YOLOX, and Ghost-YOLOX adds a CIA module to the neck network of GhostNet.

Table 7. Ablation experiment results.

It can be seen from the ablation experiment results that after replacing the YOLOX backbone and neck network with the backbone and neck network constructed by the Sim-Ghost residual module, the amount of model parameters is reduced to 16.3% of YOLOX, and the detection speed has increased by about 2.5 times, it is proved that the nonlinear traditional convolution operation does cause a large amount of feature map redundancy, and partial replacement with Ghost convolution operation can effectively save the computational cost of the network. After reducing the number of output feature maps, the detection accuracy of the network should decrease slightly, but the average accuracy of GhostNet remains basically unchanged, it is proved that the residual module designed according to the gradient path design strategy can improve the feature extraction capability of the network by increasing the shortest and longest gradient paths of the network; When the CIA module is added to the original YOLOX neck network, the parameter volume of the network increases by 1.6MB, the detection speed decreases by 10%, and the mAP increases by 2.2%, it is proved that the CIA module can provide the network with efficient channel attention and spatial attention containing global context information, thereby improving the network's ability to classify and locate fuzzy targets and small targets; After adding the CIA module to GhostNet, the network parameter volume increased by 1.6MB, the detection speed dropped by 7.5%, and the mAP increased by 2.24%. It is proved that by using the Sim-Ghost residual module to reconstruct the backbone and neck network and adding the CIA module to the neck network, Ghost-YOLOX obtains a more significant balance in the amount of model parameters, detection speed, and accuracy.

5.3.3. Industrial deployment experiment

The model in this paper is quantitatively deployed by the model conversion tool rknn-toolkit based on the python language. First, the model trained by the Ghost-YOLOX training framework is converted into onnx format to solve the problem of incompatibility between different target detection training frameworks and quantification tool versions. The model is then converted to rknn format by rknn-toolkit and deployed on an industrial embedded host.

The embedded host model used in this experiment is RK3399Pro. The embedded host processor is Cortex A72 + Cortex A53 + NPU, equipped with Android or Linux system, with a main frequency of 1.8 GHz, the GPU adopts Mali-T860MP4, the computing performance can reach 3.0TOPs, and supports 1080P video codec and H.265 hard decoding.

In order to verify that Ghost-YOLOX still has a more significant balance performance in industrial deployment, a comparative experiment was conducted on the RK3399Pro embedded host with the model trained and quantified by the current mainstream industrial target detection network SSD and YOLO series training framework. The experimental results are shown in Table .

Table 8. Deployment experiment comparison results.

It can be seen from the deployment experiment results that the YOLO series models, except YOLOV6, only require a small amount of verification sets for debugging, and have simple and effective quantification methods, in addition to the reduction in the number of parameters before and after quantification, the detection accuracy and speed will only decrease slightly due to the change in model format. The SSD model requires manual debugging settings for the size and shape of the prior box of each layer in the network, and the quantisation process only provides a small number of images as a verification set for debugging, so its accuracy drops significantly after quantisation. In summary, the Ghost-YOLOX algorithm is more suitable for deployment on nighttime construction site monitoring platforms than other target detection algorithms.

5.3.4. Analysis of test results

To verify that Ghost-YOLOX has excellent detection performance in different scenarios, several different types of difficult-to-detect images representative of the test set were selected. the detection results of the YOLOX algorithm and the Ghost-YOLOX algorithm are shown in Figures , the upper image in the figure shows the YOLOX test results and the lower image shows the Ghost-YOLOX test results. Figure shows the detection and comparison results under complex backgrounds. Due to the complex lighting conditions, it is difficult to distinguish the foreground and background of the image, Ghost-YOLOX can still classify and locate targets accurately; Figure shows the detection and comparison results of fuzzy targets because the camera is on a moving tower crane, the captured images often have motion blur, Ghost-YOLOX still accurately detected the blurred target; Figure is the comparison detection result of the occluded target, the target to be detected is occluded by other objects, Ghost-YOLOX still accurately detects the complete outline of the occluded object.

Figure 18. Complex background comparison results.

Figure 18. Complex background comparison results.

Figure 19. Fuzzy target comparison results.

Figure 19. Fuzzy target comparison results.

Figure 20. Occlusion target contrast effect.

Figure 20. Occlusion target contrast effect.

In summary, YOLOX has missed and false detections with complex background targets, blurred targets, and occluded targets, Ghost-YOLOX has better robustness in complex and changeable night site detection scenarios, the adaptability of the model is stronger, and it can effectively detect targets that are difficult for YOLOX to detect.

6. Conclusion

In view of the challenging problems faced by nighttime construction site monitoring, this paper uses YOLOX as the benchmark algorithm and makes targeted improvements to the network structure, The improved Ghost-YOLOX can effectively meet the needs of night construction site monitoring. The Ghost-YOLOX algorithm reconstructs the backbone and neck network of YOLOX through the Sim-Ghost module, which greatly reduces the amount of parameters and improves the detection speed; By adding the CIA module to the network, the network can obtain more efficient channel and spatial attention, and improve the detection ability of the network for fuzzy targets and small targets. The experimental results show that compared with the original YOLOX algorithm, the Ghost-YOLOX algorithm reduces the number of parameters by 83.7%, improves the average accuracy by 2.24%, and increases the detection speed by 2.3 times. Its computing cost can meet the requirement of being applied to edge equipment on construction sites with limited computing power. At the same time, it has good detection accuracy for workers who are often in dimly lit areas and in motion on construction sites at night, and can effectively protect the life safety of workers when working at night.

Although Ghost-YOLOX can effectively meet the needs of nighttime construction site inspection, its improvements are all aimed at the difficulty of detecting nighttime images of construction sites, if it is applied to detect daytime images, the detection accuracy will not improve or even decrease slightly. The next step of research will focus on how to make the algorithm more robust without increasing network computing costs, and how to simultaneously improve its detection results in daytime monitoring and nighttime monitoring.

Author's profile

Han Guijin, male, Associate Professor, Doctor of Engineering, research interests in image processing.

Ruixuan Wang, male, graduate student,research interests are target recognition.

(corresponding author email:[email protected] Full postal address: West Zone, Chang'an Campus, Xi'an University of Posts and Telecommunications, Xi'an, Shaanxi Province, China)

Xu Wuyan, male, master of engineering, research direction is object recognition.

Li Jun, female, is a graduate student, and her research direction is object tracking.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Shaanxi Provincial Science and Technology Department [grant number: 2023-YBGY-032].

References

  • Bochkovskiy, A., Wang, C. Y., & Liao, H. (2020). YOLOv4: Optimal speed and accuracy of object detection. ArXiv abs/2004.10934 (2020): n. pag.
  • Chen, J. H., Deng, S., Wang, P., Huang, X., & Liu, Y. (2023a). Lightweight helmet detection algorithm using an improved YOLOv4. Sensors, 1256. https://doi.org/10.3390/s23031256
  • Chen, W. M., Li, C., & Guo, H. (2023b). A lightweight face-assisted object detection model for welding helmet use. Expert Systems with Applications, 221.
  • Cui, Z., Zhu, Y., Gu, L., Qi, G., Li, X. X., Gao, P., Zhang, Z. H., & Harada, T. (2022). RestoreDet: Degradation equivariant representation for object detection in low resolution images [J].
  • Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J. W., & Liu, W. Y. (2021). You only look at one sequence: Rethinking transformer in vision through object detection [J].
  • Ge, Z., Liu, S., & Wang, F. (2019). YOLOX. Exceeding YOLO Series in 2021. IEEE/CVF International Conference on Computer Vision (ICCV), 9626–9635. https://doi.org/10.48550/arXiv.2107.08430.
  • Glenn, J. (2020). YOLOv5. https://github.com/ultralytics/yolov5.
  • Gong, A., Li, Z. H., & Liang, C. R. (2023). Real-time pedestrian detection algorithm for adjacent perception in multiple scenes at night [J]. Journal of Image and Graphics, 2693–2705.
  • Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020). GhostNet: More features from cheap operations. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA (pp. 1577–1586). https://doi.org/10.1109/CVPR42600.2020.00165.
  • Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design [J].
  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. https://doi.org/10.48550/arXiv.1704.04861.
  • Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2017). Squeeze-and-Excitation Networks[C]. IEEE.
  • Hussain, M. (2023). YOLO-v1 to YOLO-v8, the rise of YOLO and Its complementary nature toward digital manufacturing and industrial defect detection. Machines, 11(7), 677. https://doi.org/10.3390/machines11070677
  • Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. https://doi.org/10.48550/arXiv.1602.07360.
  • Li, C., Li, L., Jiang, H., Weng, K., Geng, Y. F., Li, L., Ke, Z., Li, Q. Y., Chen, M., Nie, W. Q., Li, Y. D., Zhang, B., Liang, L. Y., Xu, X. M., Chu, X. X., Wei, X. M., & Wei, X. L. (2022). YOLOv6: A single-stage object detection framework for industrial applications[J]. arXiv preprint arXiv:2209.02976.
  • Liang, M. F., & Li, L. (2021). Enhanced YOLOv3 fuzzy target detection based on adversarial neural network. Computer Applications and Software, 38(10), 221–228.
  • Liu, J. (2021). Multi-target detection method based on YOLOv4 convolutional neural network. Journal of Physics: Conference Series, 1883(1), 012075. https://doi.org/10.1088/1742-6596/1883/1/012075
  • Minsoo, P., Dai, Q. T., Jinyeong, B., & Seunghee, P. (2023). Small and overlapping worker detection at construction sites. Automation in Construction. ISSN 0926-5805.
  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), Las Vegas (pp. 779–788). https://doi.org/10.1109/CVPR.2016.91.
  • Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) (pp. 6517–6525). IEEE.
  • Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement [J]. arXiv e-prints.
  • Ren, S. Q., He, K. M., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. http://doi.org/10.1109/TPAMI.2016.2577031
  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 4510–4520. https://doi.org/10.48550/arXiv.1801.04381.
  • Shi, H. Q., Xie, H., Zhang, M. Y., Jiang, L., & Chen, R. (2023). A low-illumination pedestrian detection algorithm based on attention mechanism[J]. Internet of Things Technology, 13(2), 27–29.
  • Shivanthan, Y., Andy, S., Dyer, A. G., & Song, A. (2018). Saliency preservation in low-resolution grayscale images. In European Conference on Computer Vision (2018) (pp. 237–254).
  • Shu, Z. T., Zhang, Z. B., Song, Y. Z., Wu, M. M., & Yuan, X. B. (2023). Low-illumination image target detection based on improved YOLOv5[J]. Laser & Optoelectronics Progress, 60, 4–8. https://doi.org/10.3788/LOP212965.
  • Sunkara, R., & Luo, T. (2023). No more Strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. Machine Learning and Knowledge Discovery in Databases (2022).
  • Tang, K., Chen, L., Zhang, Z. J., & Zhang, S. Y. (2020). Statistical analysis and countermeasures of production safety accidents in China's construction industry [J]. Construction Safety, 40–43.
  • Wang, C. Y., Bochkovskiy, A., & Liao, H. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[J]. arXiv e-prints.
  • Wang, J., Chen, Y., Dong, Z., & Gao, M. (2023a). Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Computing and Applications, 35(10), 7853–7865. https://doi.org/10.1007/s00521-022-08077-5
  • Wang, Y. X., Wang, Z., Wu, B., & Yang, G. (2021). A review of safety helmet wear detection algorithms for smart construction sites [J]. Journal of Wuhan University of Technology, 56–62.
  • Wang, Z. J., Cai, Z., & Wu, Y. (2023b). An improved YOLOX approach for low-light and small object detection: PPE on tunnel construction sites. Journal of Computational Design and Engineering, 10(3), 1158–1175. https://doi.org/10.1093/jcde/qwad042
  • Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module[J]. Springer.
  • Yang, J. G., Ren, L., & Sun, S. Q. (2023). A conveyor belt roller detection algorithm for coal transportation at night [J]. Mining Equipment, 56–59.
  • Yang, L. X., Zhang, R.-Y., Li, L., & Xie, X. (2021). Simam: A simple, parameter-free attention module for convolutional neural networks. International Conference on Machine Learning (2021).
  • Zhang, R., Gao, S. B., Zhao, X., & Hou, X. L. (2023). Night-time vehicle target detection algorithm for autonomous driving based on improved YOLOv5s[J/OL]. Electronic Measurement Technology, 1–9. http://kns.cnki.net/kcms/detail/11.2175.TN.20230523.1113.002.html.