319
Views
0
CrossRef citations to date
0
Altmetric
Research Article

DOE: a dynamic object elimination scheme based on geometric and semantic constraints

, , ORCID Icon, &
Article: 2293460 | Received 07 May 2023, Accepted 06 Dec 2023, Published online: 14 Dec 2023

Abstract

In this paper, we propose a dynamic object elimination algorithm that combines semantic and geometric constraints to address the problem of visual SLAM being easily affected by dynamic feature points in dynamic environments. This issue leads to the degradation of localisation accuracy and robustness. Firstly, we employ a lightweight YOLO-Tiny network to enhance both detection accuracy and system speed. Secondly, we integrate the YOLO-Tiny network into the ORB-SLAM3 system to extract semantic information from the images and initiate the elimination of dynamic feature points. Subsequently, we augment this approach by incorporating geometric constraints between neighbouring frames to further eliminate dynamic feature points. Then, the former is supplemented by combining the geometric constraints between neighbouring frames to further eliminate dynamic feature points. Experiments on the TUM dataset demonstrate that the algorithm in this paper can improve the Relative Pose Error (RPE) by up to 95.12% and the Absolute Trajectory Error (ATE) by up to 99.01% in high dynamic sequences compared to ORB-SLAM3. The effectiveness of dynamic feature point elimination is evident, leading to significantly improved localisation accuracy.

1. Introduction

The primary technique for enabling autonomous location and mapping in an uncharted area is Simultaneous Localization and Mapping (SLAM) (Alsadik & Karam, Citation2021; Li et al., Citation2021). Robots can independently complete navigation, obstacle avoidance, human recognition, alarm detection, and other works in dynamic environments (Kang et al., Citation2019; Wu et al., Citation2019; Xia et al., Citation2013). The sensors used in the current SLAM system mainly include vision sensors, lidars, and Inertial Measurement Units (IMU). Monoculars, binoculars, RGB-D, and event cameras are different types of vision sensors, for which open-source solutions, such as ORB-SLAM2 (Mur-Artal & Tardós, Citation2017), LSD-SLAM (Engel et al., Citation2014), and SVO (Forster et al., Citation2014), have been applied. Due to the use of visual sensors, Visual SLAM is gradually gaining popularity as one of the most active research areas in the SLAM community. Visual SLAM (Sharafutdinov et al., Citation2023; Shen et al., Citation2019; Wan et al., Citation2019) estimates the location pose based on adjacent image frames captured by the camera. Currently, two mainstream types of SLAM systems are employed: direct and feature point methods.

The feature point method extracts and detects each camera frame's feature points and obtains the camera's pose information by matching and calculating the feature points between the image frames (Hu et al., Citation2020; Li et al., Citation2020; Zhou et al., Citation2012). On the other hand,the direct method estimates the camera motion using grayscale intensity variations between images, without requiring the extraction and matching of feature points. Compared with the SLAM system of the feature point method (Guo et al., Citation2022; Shu et al., Citation2010; Zeng et al., Citation2010), the direct method is more efficient. It can work adequately in feature-missing situations since it does not have to calculate the key points and feature point descriptors (Wang et al., Citation2020). However, the direct method is limited by the assumption of grayscale invariance and is sensitive to variations in lighting conditions.

Since traditional feature point-based approaches are susceptible to a range of problems. Such as lack of texture leading to insufficient feature points, rapid camera motion leading to feature point mismatch and sudden changes in illumination leading to state estimation failure. To address these issues, more and more object detection and semantic segmentation techniques are being introduced into SLAM systems. Methods like SegNet (Badrinarayanan et al., Citation2017), Mask R-CNN (He et al., Citation2017), YOLOV3 (Kulkarni et al., Citation2021), and others are frequently used for semantic segmentation and object detection, contributing to improved accuracy in deep learning-based SLAM systems (Fang et al., Citation2014; Lu et al., Citation2020; Wan et al., Citation2014). Liu and Miura (Citation2021) proposed a new semantic real-time dynamic visual SLAM system RDS-SLAM based on ORB-SLAM3 in 2021. It combines Mask R-CNN and SegNet semantic segmentation algorithm to avoid waiting for semantic segmentation results of tracking threads. The key frame selection approach can efficiently leverage the findings of semantic segmentation to identify dynamic objects and reduce outliers while preserving the real-time performance of the algorithm. Bescos et al. (Citation2018) constructed the dynamic scene SLAM system DynaSLAM in the classical ORB-SLAM2 system. This system directly removes the ORB features extracted from moving object areas. In dynamic scenes of monocular, binocular and RGB-D datasets, the multi-view geometry method and Mask R-CNN semantic segmentation network are combined to recover the background obscured by dynamic objects. However, in some cases, the work presented above can lead to two problems. Firstly, when the scene image contains many dynamic elements. Directly eliminating all features connected to dynamic objects can result in fewer visual localisation points and may lead to trajectory loss, which affects the accuracy of SLAM localisation and mapping. Secondly, objects with mobility but currently in a static state can also appear in the image, leading to unreliable localisation and mapping construction. It is challenging for conventional methods to extract useful features from dynamic environments (Cheng et al., Citation2019).

Researchers have explored the issue of poor localisation accuracy in SLAM caused by the presence of dynamic objects, and two main approaches have been proposed. The first method mainly combines the geometric method and deep learning method. A semantic segmentation network retains static feature points while eliminating feature points from dynamic objects. This research (Jiao et al., Citation2021) proposes a probabilistic approach for estimating dynamic environmental motion that accounts for the motion state of feature points and the environment's semantic information. The dynamic object neural network is combined with the motion information and pose estimation is performed using static feature points in static regions. By doing this, the influence of dynamic objects on visual SLAM is lessened.In another paper (Zhang et al., Citation2021), an RGB-D SLAM algorithm is proposed for dynamic scenes, combining MobileNet and K-Means algorithms in deep learning to eliminate dynamic objects as much as possible. The paper (Kang et al., Citation2021) introduces an object-oriented 3D tracking camera pose object motion detection method that relies on a pre-constructed object database. The method employs an instance segmentation algorithm that combines both semantic and geometric cues. The obtained results establish associations with the object databases, which are updated through object motion detection. This approach allows for the identification of dynamic objects in the current frame.

Even though the solutions above perform better at detecting semantic information of objects, the accuracy of localisation using semantic information association and real-time system performance still have the potential for improvement and research (Wu et al., Citation2018). As a result, a successful plan must be established to address these issues (Zhao et al., Citation2019). Some of the current semantic-based methods are based purely on object detection and semantic segmentation, or combined with probabilistic methods. However,these methods can only eliminate the effect of dynamic objects on the SLAM system to a certain extent. This paper proposes a method that combines object detection and epipolar geometry to eliminate the dynamic features in the scene more effectively. The semantic information provided by object detection can initially filter the dynamic information in the scene, epipolar geometry can detect and eliminate dynamic features based on information between image frames. The combination of the two methods can compensate for the lack of a priori information for object detection. The scheme can be utilised in robotic systems operating in dynamic environments. For example, In automated warehousing and logistics systems, dynamic SLAM technology can help the AGV(Automated Guided Vehicles) or robots to achieve autonomous navigation and motion control, ensuring that they can accurately avoid dynamic obstacles, such as moving objects or other robots.

1.1. Motivation and contributions

In dynamic environments, existing SLAM systems often exhibit fragility, limiting their application in real-world settings. Therefore, addressing the issue of dynamic objects is crucial for improving the performance and reliability of SLAM systems, enabling them to better adapt to various real-world application scenarios. This paper introduces several innovative points:

  1. A light YOLO-Tiny network is used in the ORB-SLAM3 system to extract semantic information from images and initially eliminate dynamic feature points to improve the detection accuracy and speed of the system.

  2. We propose a Dynamic Object Elimination Algorithm that combines semantic and geometric constraints to detect semantic information about objects in the scene and filter feature points on potentially moving objects. Subsequently, it estimates a more precise inter-frame transformation matrix using the remaining features and leverages epipolar geometry to identify and filter out genuinely dynamic features. This algorithm is then seamlessly integrated into ORB-SLAM3.

The essay is structured as follows for the remaining sections. Section 2 reviews the related work concerning the elimination of dynamic features. It discusses previous research and approaches in this area. Section 3 delves into the specifics of our method. We present the details of the dynamic object elimination algorithm (DOE) that combines both geometric and semantic constraints. The primary goal of the algorithm is to enhance the localisation accuracy of SLAM in the presence of dynamic objects. In Section 4, we present the experimental results. We integrate the DOE algorithm into ORB-SLAM3 and evaluate its performance along with the overall robustness of the system. In Section 5, we provide a concise conclusion that summarises the key findings and contributions of the paper.

2. Related work

The advancement of artificial intelligence has promoted the development of robots (Tian et al., Citation2023; Xiong et al., Citation2012; Zeng et al., Citation2010). Numerous social, economic, technical, and organisational issues will arise due to this rapid urbanisation growth (Huang et al., Citation2020; Kumar et al., Citation2021; Wang et al., Citation2021Citation2020). Robots are mainly used in complex environments, and improving the accuracy of their location is vital. In complex scenes, the classical framework cannot obtain correct inter-frame matching, resulting in a significant reduction in both location and mapping precision. To address this issue, outliers in the image's odometer portion need to be removed accurately (Shen et al., Citation2019; Zhang et al., Citation2020Citation2018). Additionally, the triangle observation concept has been used to determine points' motion state, thereby avoiding the computation of the fundamental matrix (Migliore et al., Citation2009).

In recent years, deep learning technology has made significant advancement, which offers a fresh approach to object segmentation issues (Li et al., Citation2020). The pixel-level segmentation of images using deep learning can compensate for the fact that the geometric method cannot easily obtain the whole contour of a moving object. To extract the semantic data of dynamic objects, Xiao et al. (Citation2019) execute SSD in a separate thread. A selective tracking algorithm is utilised in the tracking thread to process the characteristics of dynamic objects, which can significantly lower the pose estimate error. Another approach proposed by Wang et al. (Citation2019) presents an efficient SLAM for dynamic scenarios based on YOLOv3. The method detects objects using the YOLO network, accurately segments the detected objects' contours using a depth map-based overflow filling algorithm, and constructs a static semantic graph without dynamic objects. However, one limitation of relying solely on semantic information is the inability to determine the motion state of an object. One drawback of using only semantic information is the inability to determine the motion state of an object.

Numerous studies have tried to improve the performance of outlier removal by combining geometric constraints with semantic information. Li and Chen (Citation2022) combined feature direct and semantic information into a probabilistic model for calculating the probability of the moving objects in a monocular camera's field of view. When an object's motion probability exceeds a predetermined threshold, its feature points are eliminated. The advantage of this method is that it can still make reasonable estimates when dynamic objects occupy most of the camera's field of view. Yu et al. (Citation2018) propose a DS-SLAM scheme. where SegNet is used to semantically segment images and determine outliers. If a feature point is located in the segmentation area of a human body, all feature points within this area are eliminated. DP-SLAM (Li et al., Citation2021) combines geometry models with deep-learning-based algorithms and employs a moving probability propagation model to identify important dynamic locations. By using a Bayesian probability estimation framework, dynamic key points are tracked to filter out those associated with moving objects. DynaSLAM II (Bescos et al., Citation2021) uses instance semantic segmentation and ORB features to track dynamic objects. An advanced 2D instance matching algorithm is proposed for feature matching of dynamic objects, enabling bundle adjustment of the camera, feature points, and dynamic objects together.

This method cannot be implemented since the dynamic object mask is computed at the pixel level. The paper (Xie et al., Citation2020) proposes a new motion detection and segmentation method based on RGB-D data. The paper introduces an inpainting method for incomplete masks generated by Mask R-CNN to solve the segmentation problem for active moving objects. For passive moving objects, the method adopts a motion detection approach based on LK optical flow. DOE-SLAM (Hu & Lang, Citation2021) uses the features of objects and predicted object motion to estimate the attitude of the camera to track the perspective of the moving object. DOT-SLAM (Ballester et al., Citation2021) robustly detects and tracks dynamic objects by combining instance segmentation and multi-view geometric equations. The Blitz-SLAM (Fan et al., Citation2022) system removes the noise blocks from the local point cloud by combining the advantages of semantic and geometric information from the mask, RGB, and depth images. The global point cloud image is obtained by merging the local point clouds. PLSD-SLAM (Yuan et al., Citation2023) system proposes a dynamic feature tracking method based on Bayesian theory to eliminate the dynamic noise of points and lines and improve the robustness and accuracy of the SLAM system. This system improves the localisation accuracy of the SLAM system but introduces line features, thus leading to poor real-time performance. Table  shows a detailed comparison of the above methods.

Table 1. Comparison of dynamic SLAM methods evaluation.

3. Our proposed DOE scheme

In this section, we begin by introducing the semantic elimination algorithm, followed by a detailed description of our approach. We propose the Dynamic Object Elimination Algorithm (DOE) that combines geometric and semantic constraints. The semantic information provided by object detection can initially filter the dynamic information in the scene, epipolar geometry can detect and eliminate dynamic features based on information between image frames. The combination of the two methods can compensate for the lack of a priori information for object detection.

3.1. Semantic elimination algorithm

In applications, object detection can provide basic image understanding for SLAM systems and improve SLAM performance through semantic label information, such as optimising feature point matching and eliminating dynamic feature points. Traditional SLAM can only provide basic and low-level information but cannot provide high-level semantic information. In semantic SLAM, deep learning can obtain more semantic information from images. The semantic information can be integrated into traditional SLAM, which can be further applied in mapping and navigation. One of the most well-liked object detection techniques at the moment is YOLO. The algorithm is improved on OverFeat. YOLOv3-tiny, a simplified version of YOLOv3, retains the backbone network of YOLOv3, making it lightweight and suitable for deployment on small platforms like embedded systems. However, due to the reduced network scale, there may be a slight loss in detection accuracy. In this paper, we leverage a lightweight network called YOLO-Tiny for feature point matching and dynamic feature point elimination in dynamic environments, providing real-time semantic information to SLAM systems. Figure (a-b) display the RGB original images and their corresponding detection results. For indoor scenes, people are considered highly dynamic objects, and dynamic semantic labels are assigned to them as prior information for the SLAM system. In the semantic elimination algorithm section, only dynamic objects from the prior information are detected. The data with the dynamic semantic label elimination is then further processed in the subsequent algorithm steps or the tracing thread for further analysis and optimisation.

Figure 1. Original RGB image and result of YOLO-Tiny detection.

Figure 1. Original RGB image and result of YOLO-Tiny detection.

3.2. Dynamic object detection and elimination based on epipolar geometric constraints

The green dots in Figure  are space points. The blue points are imaging points corresponding to the space points, and the two parallelograms on the left and right are two frames of the same scene by the feature point method or the direct method. Figure also depicts the camera observation model. The observation model illustrates that in a static scene, the imaging change of the space point between different image frames is solely due to the camera's movement. However, in a dynamic environment, the motion of space points is not only influenced by the camera's movement but also by the dynamic objects in the scene. The SLAM system can make the mistake of attributing the motion of dynamic feature points to the camera's movement, leading to significant errors in the trajectory prediction of the SLAM system that utilises dynamic feature points for pose estimation.

Figure 2. Schematic diagram of camera observation.

Figure 2. Schematic diagram of camera observation.

It is possible to determine a feature's dynamic or static status using epipolar geometry. Figure  shows the geometric relationship between two adjacent images, p is the static feature point of the scene, I1 and I2 are two adjacent frames of images of the same scene, the optical centres of the two cameras are O1 and O2, and the projection points of p on I1 and I2 are P1, P2, O1, O2 is the reference line. The reference line intersects with I1 and I2 at e1 and e2 respectively, and the straight lines I1 and I2 where the plane PO1O2 intersects the image planes l1 and l2 are the epipolar on the image planes O1 and O2.

Figure 3. Epipolar geometry diagram.

Figure 3. Epipolar geometry diagram.

Suppose you only know the P1 in the I1, and you want to find the P2 in the corresponding I2, as shown in Figure (b), in the absence of depth information, the space point of the P1 reflection is p or P, but the projection points appear on the epipolar, P1 on the l1, and the corresponding P2 appears on the l2. The mapping relationship between a point in one image and the matching epipolar in another image is given by this geometric restriction, which can be expressed by the fundamental matrix F: (1) p2TFp1=0(1) Given the point P1 in the I1 and the fundamental matrix F, where p1, p2 are the coordinates of space point p in the I1 and I2 camera coordinate systems. Equation (Equation1) provides the constraints that P2 must satisfy if the mapping point p is static. This constraint can be used to distinguish whether the feature points corresponding to the ORB feature are dynamic. However, two image points of static map points may not strictly satisfy Equation (Equation1) because of the inevitable uncertainties in the estimation of feature extraction and the fundamental matrix F. Consequently, P2 is very close to the epipolar, as seen in P in Figure , but is not exactly on the epipolar specified by image point P1 and the fundamental matrix F. Calculating the distance D between P2 and l2: (2) D=|P2¯TFP1¯|X2+Y2(2) X, Y represents the epipolar equation vector, P1¯, P2¯ is the normalised plane coordinates of the P1, P2 point. If D is lower than the set threshold of 0.8, the image point is considered static; otherwise, it is considered dynamic. Therefore, the key to epipolar geometry lies in the calculation of the fundamental matrix F. F is expressed as Equation (Equation3): (3) F=(f1f2f3f4f5f6f7f8f9)(3) The standard methodology is the eight-point method, but F can be calculated using at least five pairs of feature points. Suppose P1¯=[u1,v1,1], P2¯=[u2,v2,1], (u1,v1)(u2,v2) are the coordinate of P1, P2. It can be obtained by Equations (Equation1), (Equation2) and (Equation3): (4) (u1,v1,1)=(f1f2f3f4f5f6f7f8f9)(u2v11)(4) Let f be the vector denoting all vectors containing the fundamental matrix F:f=(f1,f2,f3,f4,f5,f6,f7,f8,f9). Then Equation (Equation4) can be expressed as: (5) (u1u1,u1v2,u1,v1u2,v1v2,v1,u2,v2,1)f=0(5) Where f represents all vectors containing the fundamental matrix F. There are 9 unknowns in Equation (Equation5), but F only has 8 degrees of freedom, so the fundamental matrix F can be obtained by obtaining 8 pairs of feature points from two frames images.

3.3. Dynamic feature point filtering algorithm based on geometric and semantic constraints

Since the simple semantic constraints are not accurate enough for the localisation of dynamic objects, this paper proposes a dynamic object elimination algorithm (DOE) combining geometric constraints and semantic constraints, and the algorithm is shown in Figure . Firstly, the ORB feature is extracted from the currently displayed image frame. Concurrently, in another independent thread, the object detection model performs preliminary dynamic object detection on the current image frame. Subsequently, the algorithm employs epipolar geometry to further eliminate dynamic feature points from the object detection results. Finally, the fundamental matrix is calculated by matching pairs of feature points, and the filtered dynamic feature points' outcome is delivered to the tracking thread of SLAM for processing.

Figure 4. Dynamic feature point filtering algorithm.

Figure 4. Dynamic feature point filtering algorithm.

Dynamic object detection using YOLO-Tiny provides a semantic prior frame instead of the real contour of the dynamic object. However, when all feature points from the prior frame are removed, it can significantly affect the fundamental matrix calculation, especially in low-dynamic scenarios, where there are many static feature points in the a priori frame.

The constraints of epipolar geometry make up for this deficiency. The overall algorithm determines whether the feature point pair exists in the prior box, and if it does, it further determines whether it conforms to the epipolar geometry constraint; if not, it is finally determined as dynamic feature points and eliminated; If it conforms to the constraint, it may be considered a static feature point in the semantic box and should be retained; if it does not exist in the prior box, the feature point pair is directly judged by using epipolar geometry method. The detailed processing strategy for feature point pair is shown in Algorithm 1, where Bof represents the rectangular region of priori box, θ represents the threshold of D, Di(i=1,2,,N) represents the distance of the ith feature point from the epipolar, and its calculation equation is shown in Equation (Equation6), and the number of feature points is denoted by N. (6) θ=i=1NeDiN(6)

4. Performance analysis

In this Section, we use the TUM dataset for experimental validation. We tested our algorithm within the system, and conducted ablation experiments to further verify the effectiveness of the DOE algorithm.

4.1. Experimental setting

In this paper, the TUM dataset is selected for verification. Dynamic environment datasets in the TUM dataset, such as the sitting and walking datasets, are selected. The experimental operating environment is Ubuntu 20.04, 3.6GHz Intel i7-12700H CPU, 4800MHz / 16G memory, and Nvidia RTX 3060.

4.2. Performance of SLAM system localisation accuracy

The primary goal of the DOE algorithm is to remove dynamic feature points from dynamic scenes, leading to improved location accuracy for the overall SLAM system. In this section, we conduct tests to evaluate the location accuracy of the enhanced SLAM system. It should be pointed out that the Localization accuracy of the improved system does not change obviously in sitting series. In the experiment of this section, low-dynamic datasets are only represented by datasets of halfsphere, while high-dynamic datasets (walking series) are all tested. In this study, the location accuracy of the SLAM system is assessed using two metrics: Relative Trajectory Error (RPE) and Absolute Trajectory Error (ATE). Tables  and show each dataset's RPE evaluation and ATE evaluation in ORB-SLAM3, where the mean value is the average of multiple experiments for each dataset.

Table 2. TUM dynamic dataset RPE evaluation.

Table 3. TUM dynamic dataset ATE evaluation.

According to the definitions of ATE and RPE, a smaller RMSE value indicates a higher localisation accuracy of the solution. As shown in Tables  and , when compared with ORB-SLAM3, our scheme exhibits only a slight optimisation effect for static (low dynamic) datasets. This is due to the relatively low number of dynamic feature points in the scene. However, in highly dynamic scenarios, our scheme demonstrates significant optimisation, resulting in a noticeable reduction in errors. This effect is most pronounced in the walking (high dynamic) dataset, with RPE increasing by 95.12% and ATE increasing by 99.01%. The DOE algorithm successfully accomplishes the original goal of this article. This method can operate stably in low dynamic environments and effectively remove dynamic feature points in high dynamic environments, thereby improving the localisation accuracy of the SLAM system.

To further visualise the errors between the schemes, we used the “evaluate.py” tool provided by TUM and plotted the results in Figures  to . In these Figures, the black line represents the real trajectory, the blue line represents the predicted trajectory, and the red line represents the error between the two. Figures (a), (a), (a), (a), (a) are the test results of ORB-SLAM3 in different dynamic datasets. There are many red areas in the figures, indicating a large error. Figures (b), (b), (b), (b), (b) are the test results after the DOE algorithm removes dynamic feature points. Compared with the original scheme, the optimisation effect of low dynamic datasets is not obvious but the errors of other high dynamic datasets are greatly reduced. The test results prove that the DOE algorithm can make the trajectory prediction of the system more accurate after the dynamic feature points are eliminated.

Figure 5. Results before and after removing dynamic objects in fr3_sitting_halfsphere.

Figure 5. Results before and after removing dynamic objects in fr3_sitting_halfsphere.

Figure 6. results before and after removing dynamic objects in fr3_walking_xyz.

Figure 6. results before and after removing dynamic objects in fr3_walking_xyz.

Figure 7. Results before and after removing dynamic objects in fr3_walking_halfsphere.

Figure 7. Results before and after removing dynamic objects in fr3_walking_halfsphere.

Figure 8. Results before and after removing dynamic objects in fr3_working_static.

Figure 8. Results before and after removing dynamic objects in fr3_working_static.

Figure 9. Results before and after removing dynamic objects in fr3_working_rpy.

Figure 9. Results before and after removing dynamic objects in fr3_working_rpy.

4.3. Performance of dynamic feature point elimination

In this paper, both low and high dynamic scenes are tested and dynamic feature points are eliminated based on ORB-SLAM3, which is combined with the DOE algorithm in this paper. Experimental results of low dynamic scenes are shown in Figure . In Figure , (b) (f) (j) (n) are the running effect of ORB-SLAM3 in the low dynamic dataset, (c) (g) (k) (o) illustrate the elimination of dynamic features within the ORB-SLAM3 system using YOLO-Tiny and (d) (h) (l) (p) are the result of filtering dynamic feature points through DOE algorithm, which combines semantic constraints and geometric constraints.

Figure 10. Comparison of low dynamic scene.

Figure 10. Comparison of low dynamic scene.

4.3.1. Low dynamic scene

Figure shows the TUM data sitting series dataset, in which two people are sitting at a desk and talking. The first and second columns show the effect of the original RGB images in the dataset with ORB-SLAM3 for each dataset, respectively. The third and fourth columns show the effect of semantic constraint-based dynamic feature point elimination and the effect of DOE dynamic feature point elimination in the ORB-SLAM3 system, respectively. Each row shows the experimental results on the same dataset. The scene of four lines of renderings is the same, but the camera movement is different for each dataset. Two people in the scene have some physical movements, indicating that this is a low dynamic scene. The four data types in the sitting are tested separately based on the camera movements.

Figure (b,f,j,n) shows that the dynamic feature points are not handled in ORB-SLAM3, resulting in a significant reduction in the accuracy of trajectory prediction. The specific accuracy tests of each dataset are described in detail in the next section. Figure (c,g,k,o) is to detect dynamic objects in image frames through the YOLO-Tiny object detection network and then eliminate the effect diagram of all feature points in the rectangular box where the object is located. It can be seen from the figure that although the two people only have local body movements, according to the semantic priori information, people belong to dynamic objects. The object detection network recognises this dynamic object and all dynamic feature points inside the rectangular frame are eliminated.

However, a lot of static feature points can be found in Figure (g,k,o). For example, the feature points on the computer display screen on the right side are all deleted. The massive reduction of feature points significantly impacts the accuracy of trajectory prediction. The DOE algorithm makes up for this deficiency of semantic constraints. The algorithm combines geometric constraints based on semantic constraints and recovers some static features wrongly divided into dynamic feature points by semantic constraints. It can be seen from Figure (d,h,l,p) that some static features are corrected by the results of the DOE algorithm, compared with Figure (c,g,k,o).

The analysis demonstrates that YOLO-Tiny can detect objects in a scene quickly. The Dynamic feature point elimination algorithm can successfully remove the dynamic feature points in SLAM, and the developed DOE algorithm has a good optimisation effect on removing the feature points of semantic constraints.

4.3.2. High dynamic scene

Figure  shows the effect diagram in the high dynamic scene, which belongs to the dataset of the walking series and is also divided into four categories according to the camera motion. In the scene, two people are walking near the desk, creating a high dynamic environment. Figure (b,f,j,n) shows a large number of detected feature points appearing on the walking human body when the ORB-SLAM3 system is operated in a highly dynamic environment. According to the previous section, the SLAM system has poor localisation accuracy or failure because of these dynamic feature points. In Figure (c,g,k,o), an object detection network directly detects people with dynamic attributes, and all feature points on people's bodies are filtered out.

Figure 11. Comparison of high dynamic scene.

Figure 11. Comparison of high dynamic scene.

Some feature points, such as the feature points of the feet, are not eliminated because there are specific errors in the location of rectangular boundary boxes given by YOLO-Tiny. The DOE algorithm's filtering results of feature points in a highly dynamic environment are shown in Figure (d,h,l,p). Compared with Figure (c,g,k,o), some static feature points are also corrected and retained, while most dynamic feature points with semantic constraints remain unchanged and remain in the filtering state. However, some feature points of the human hands are still not filtered out in Figure (d). The reason is that the motion speed of this part is too slow, resulting in small changes between two adjacent frames. Therefore, the calculated D<θ, causing the dynamic feature point determination to fail due to the small value of θ.

According to the analysis provided above, the DOE algorithm in this paper significantly impacts the elimination of dynamic feature points in the scene, no matter in low-dynamic or high-dynamic datasets. The disadvantage is that the bounding box Localization of the YOLO-Tiny network is not accurate enough. Due to the scale of the network, some accuracy is sacrificed to improve detection speed and meet the requirements of high real-time SLAM.

We conducted ablation experiments on the proposed method, and the results depicted in Table  reveal that the system employing both semantic and geometric constraints achieves the smallest error across all sequences. Conversely, the system relying solely on semantic constraints exhibits consistent performance due to the exclusion of certain dynamic feature points. However, this system manifests a higher error due to the presence of dynamic object motion, which introduces estimation errors during such instances. In conclusion, a fusion of semantic and geometric constraints is essential to achieve a lower error rate for the system.

Table 4. Root mean square Error of high dynamic scene (RMSE).

In order to further verify the effectiveness of the algorithm in this paper, as shown in , we conducted a comparative analysis with the point feature extraction algorithm used in current advanced systems. We selected two sequences for comparative study. Experimental results show that in low-dynamic sequences, ORB-SLAM3 is relatively ineffective at processing dynamic feature points. For example, when the character's body movements are small, some feature points will still be retained. In comparison, DynaSLAM and DP-SLAM perform better than ORB-SLAM3, but this algorithm performs the best. In highly dynamic sequence scenes, the original ORB-SLAM3 algorithm still retains a large number of feature points on dynamic objects in the scene. DynaSLAM and DP-SLAM still retain some feature points for people moving far away. This is because when the moving object is far away from the camera, the range of change of the person between two adjacent frames is small, and it cannot be determined as a dynamic object. In contrast, the algorithm in this paper has the most superior effect on point feature extraction by using geometric constraints and semantic constraints.

Figure 12. Point feature extraction comparison for other algorithms.

Figure 12. Point feature extraction comparison for other algorithms.

4.3.3. Performance comparison with the state-of-the-art

In order to verify the effectiveness of proposed algorithm, we conducted a comparative analysis with other visual SLAM algorithms under various sequences. The experimental results are presented in Table .

Table 5. Comparison of different SLAM algorithms.

The data presented in Table  represents the RMSE values of absolute trajectory error. It demonstrates that our proposed algorithm achieves results comparable to other SLAM algorithms in highly dynamic scenes. Moreover, in certain sequences, our algorithm exhibits a slight superiority over the compared algorithms. These findings provide strong evidence of the effectiveness of our proposed algorithm in dynamic scenes.

5. Conclusion

In this paper, we combine the YOLO-Tiny network to detect objects in dynamic scenes and use prior semantic information to detect dynamic objects. We build a dynamic feature point elimination method based on semantic constraints. In addition, we propose a dynamic feature point elimination algorithm based on epipolar geometry to compensate for the lack of semantic constraints. Finally, we develop the DOE algorithm by combining the two methods. The algorithm is applied to ORB-SLAM3 for testing. It is confirmed that the algorithm dramatically increases the system's location accuracy and security. In future work, (1) we will consider integrating an inertial measurement unit (IMU) to enhance the system's robustness. (2) The current semantic detection framework may lead to false detection of static targets. In the future, we aim to select a better semantic segmentation network, such as DeepLab v3+, for semantic detection, in order to reduce our dependence on computational resources and further enhance real-time performance.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61963017; in part by Shanghai Educational Science Research Project, China, under Grant C2022056; in part by Shanghai Science and Technology Program, China, under Grant 23010501000; in part by Humanities and Social Sciences of Ministry of Education Planning Fund, China, under Grant 22YJAZHA145.

References

  • Alsadik, B., & Karam, S. (2021). The simultaneous localization and mapping (SLAM)-an overview. Surveying and Geospatial Engineering Journal, 2(1), 1–12. https://doi.org/10.38094/sgej112
  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495. https://doi.org/10.1109/TPAMI.34
  • Ballester, I., Fontán, A., Civera, J., Strobl, K. H., & Triebel, R. (2021). DOT: dynamic object tracking for visual SLAM. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 11705–11711). IEEE.
  • Bescos, B., Campos, C., Tardós, J. D., & Neira, J. (2021). DynaSLAM II: tightly-coupled multi-object tracking and SLAM. IEEE Robotics and Automation Letters, 6(3), 5191–5198. https://doi.org/10.1109/LRA.2021.3068640
  • Bescos, B., Fácil, J. M., Civera, J., & Neira, J. (2018). DynaSLAM: tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters, 3(4), 4076–4083. https://doi.org/10.1109/LSP.2016.
  • Cheng, H., Xie, Z., Shi, Y., & Xiong, N. (2019). Multi-step data prediction in wireless sensor networks based on one-dimensional CNN and bidirectional LSTM. IEEE Access, 7, 117883–117896. https://doi.org/10.1109/Access.6287639
  • Engel, J., Schöps, T., & Cremers, D. (2014). LSD-SLAM: large-scale direct monocular SLAM. In European conference on computer vision (pp. 834–849). Springer International Publishing.
  • Fan, Y., Zhang, Q., Tang, Y., Liu, S., & Han, H. (2022). Blitz-SLAM: a semantic SLAM in dynamic environments. Pattern Recognition, 121, 108225. https://doi.org/10.1016/j.patcog.2021.108225
  • Fang, W., Li, Y., Zhang, H., Xiong, N., Lai, J., & Vasilakos, A. V. (2014). On the throughput-energy tradeoff for data transmission between cloud and mobile devices. Information Sciences, 283, 79–93. https://doi.org/10.1016/j.ins.2014.06.022
  • Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). SVO: fast semi-direct monocular visual odometry. In 2014 IEEE international conference on robotics and automation (ICRA) (pp. 15–22). IEEE.
  • Guo, X., Wang, Z., Zhu, W., He, G., Deng, H., Lv, C., & Zhang, Z. (2022). Research on DSO vision positioning technology based on binocular stereo panoramic vision system. Defence Technology, 18(4), 593–603. https://doi.org/10.1016/j.dt.2021.12.010
  • He, K., Gkioxari, G., Dollár, P., & Girshick, R.. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969). IEEE.
  • Hu, W., Fan, J., Du, Y., Li, B., Xiong, N., & Bekkering, E. (2020). MDFC–ResNet: an agricultural IoT system to accurately recognize crop diseases. IEEE Access, 8, 115287–115298. https://doi.org/10.1109/Access.6287639
  • Hu, X., & Lang, J. (2021). DOE-SLAM: dynamic object enhanced visual SLAM. Sensors, 21(9), 3091. https://doi.org/10.3390/s21093091
  • Huang, S., Zeng, Z., Ota, K., Dong, M., Wang, T., & Xiong, N. N. (2020). An intelligent collaboration trust interconnections system for mobile information control in ubiquitous 5G networks. IEEE Transactions on Network Science and Engineering, 8(1), 347–365. https://doi.org/10.1109/TNSE.6488902
  • Jiao, J., Wang, C., Li, N., Deng, Z., & Xu, W. (2021). An adaptive visual dynamic-SLAM method based on fusing the semantic information. IEEE Sensors Journal, 22(18), pp. 17414–17420. https://doi.org/10.1109/JSEN.2021.3051691
  • Kang, L., Chen, R. S., Xiong, N., Chen, Y. C., Hu, Y. X., & Chen, C. M. (2019). Selecting hyper-parameters of Gaussian process regression based on non-inertial particle swarm optimization in internet of things. IEEE Access, 7, 59504–59513. https://doi.org/10.1109/Access.6287639
  • Kang, X., Li, J., Fan, X., Jian, H., & Xu, C. (2021). Object-Level semantic map construction for dynamic scenes. Applied Sciences, 11(2), 645. https://doi.org/10.3390/app11020645
  • Kulkarni, M., Junare, P., Deshmukh, M., & Rege, P. P. (2021). Visual SLAM combined with object detection for autonomous indoor navigation using kinect V2 and ROS. In 2021 IEEE 6th international conference on computing, communication and automation (ICCCA) (pp. 478–482). IEEE.
  • Kumar, P., Kumar, R., Srivastava, G., Gupta, G. P., Tripathi, R., Gadekallu, T. R., & Xiong, N. N. (2021). PPSF: A privacy-preserving and secure framework using blockchain-based machine-learning for IoT-driven smart cities. IEEE Transactions on Network Science and Engineering, 8(3), 2326–2341. https://doi.org/10.1109/TNSE.2021.3089435
  • Li, A., Wang, J., Xu, M., & Chen, Z. (2021). DP-SLAM: A visual SLAM with moving probability towards dynamic environments. Information Sciences, 556, 128–142. https://doi.org/10.1016/j.ins.2020.12.019
  • Li, G., Zhang, G., Qin, C., & Lu, A. (2020). Automatic RGBD object segmentation based on MSRM framework integrating depth value. International Journal on Artificial Intelligence Tools, 29(7-8), 2040009. https://doi.org/10.1142/S0218213020400096
  • Li, G. H., & Chen, S. L. (2022). Visual slam in dynamic scenes based on object tracking and static points detection. Journal of Intelligent & Robotic Systems, 104(2), 33. https://doi.org/10.1007/s10846-021-01563-3
  • Li, J., Qiu, M., Zhang, Y., Xiong, N., & Li, Z. (2020). A fast obstacle detection method by fusion of double-layer region growing algorithm and Grid-SECOND detector. IEEE Access, 9, 32053–32063. https://doi.org/10.1109/Access.6287639
  • Li, X., Lin, S., Xu, M., Dai, D., & Wang, J. (2021). PO-SLAM: a novel monocular visual SLAM with points and objects. In 2021 4th International conference on artificial intelligence and big data (ICAIBD) (pp. 454–458). IEEE.
  • Liu, Y., & Miura, J. (2021). RDS-SLAM: real-time dynamic SLAM using semantic segmentation methods. IEEE Access, 9, 23772–23785. https://doi.org/10.1109/Access.6287639
  • Lu, W., Li, L., He, Y., Wei, J., & Xiong, N. N. (2020). RFPS: a robust feature points detection of audio watermarking for against desynchronization attacks in cyber security. IEEE Access, 8, 63643–63653. https://doi.org/10.1109/Access.6287639
  • Migliore, D., Rigamonti, R., Marzorati, D., Matteucci, M., & Sorrenti, D. G. (2009). Use a single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments. In ICRA workshop on safe navigation in open and dynamic environments: application to autonomous vehicles (pp. 12–17). IEEE.
  • Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5), 1255–1262. https://doi.org/10.1109/TRO.2017.2705103
  • Sharafutdinov, D., Griguletskii, M., Kopanev, P., Kurenkov, M., Ferrer, G., Burkov, A., & Tsetserukou, D. (2023). Comparison of modern open-source visual SLAM approaches. Journal of Intelligent & Robotic Systems, 107(3), 43. https://doi.org/10.1007/s10846-023-01812-7
  • Shen, X., Yi, B., Liu, H., Zhang, W., Zhang, Z., Liu, S., & Xiong, N. (2019). Deep variational matrix factorization with knowledge embedding for recommendation system. IEEE Transactions on Knowledge and Data Engineering, 33(5), 1906–1918.
  • Shen, Y., Fang, Z., Gao, Y., Xiong, N., Zhong, C., & Tang, X. (2019). Coronary arteries segmentation based on 3D FCN with attention gate and level set function. IEEE Access, 7, 42826–42835. https://doi.org/10.1109/Access.6287639
  • Shu, L., Zhang, Y., Yu, Z., Yang, L. T., Hauswirth, M., & Xiong, N. (2010). Context-aware cross-layer optimized video streaming in wireless multimedia sensor networks. The Journal of Supercomputing, 54, 94–121. https://doi.org/10.1007/s11227-009-0321-6
  • Tian, Y., Chang, Y., Quang, L., Schang, A., Nieto-Granda, C., How, J. P., & Carlone, L (2023). Resilient and distributed multi-robot visual SLAM: datasets, experiments, and lessons learned. Preprint arXiv:2304.04362.
  • Wan, R., Xiong, N., Hu, Q., Wang, H., & Shang, J. (2019). Similarity-aware data aggregation using fuzzy c-means approach for wireless sensor networks. EURASIP Journal on Wireless Communications and Networking, 2019, 1–11. https://doi.org/10.1186/s13638-018-1318-8
  • Wan, Z., Xiong, N., Ghani, N., Vasilakos, A. V., & Zhou, L. (2014). Adaptive unequal protection for wireless video transmission over IEEE 802.11 e networks. Multimedia Tools and Applications, 72, 541–571. https://doi.org/10.1007/s11042-013-1378-z
  • Wang, E., Zhou, Y., & Zhang, Q. (2020). Improved visual odometry based on ssd algorithm in dynamic environment. In 2020 39th Chinese control conference (CCC) (pp. 7475–7480). IEEE.
  • Wang, J., Jin, C., Tang, Q., Xiong, N. N., & Srivastava, G. (2020). Intelligent ubiquitous network accessibility for wireless-powered MEC in UAV-assisted B5G. IEEE Transactions on Network Science and Engineering, 8(4), 2801–2813. https://doi.org/10.1109/TNSE.2020.3029048
  • Wang, Y., Fang, W., Ding, Y., & Xiong, N. (2021). Computation offloading optimization for UAV-assisted mobile edge computing: a deep deterministic policy gradient approach. Wireless Networks, 27(4), 2991–3006. https://doi.org/10.1007/s11276-021-02632-z
  • Wang, Z., Zhang, Q., Li, J., Zhang, S., & Liu, J. (2019). A computationally efficient semantic slam solution for dynamic scenes. Remote Sensing, 11(11), 1363. https://doi.org/10.3390/rs11111363
  • Wu, C., Ju, B., Wu, Y., Lin, X., Xiong, N., Xu, G., & Liang, X. (2019). UAV autonomous target search based on deep reinforcement learning in complex disaster scene. IEEE Access, 7, 117227–117245. https://doi.org/10.1109/Access.6287639
  • Wu, C., Luo, C., Xiong, N., Zhang, W., & Kim, T. H.. (2018). A greedy deep learning method for medical disease analysis. IEEE Access, 6, 20021–20030. https://doi.org/10.1109/ACCESS.2018.2823979
  • Xia, F., Hao, R., Li, J., Xiong, N., Yang, L. T., & Zhang, Y. (2013). Adaptive GTS allocation in IEEE 802.15.4 for real-time wireless sensor networks. Journal of Systems Architecture, 59(10), 1231–1242. https://doi.org/10.1016/j.sysarc.2013.10.007
  • Xiao, L., Wang, J., Qiu, X., Rong, Z., & Zou, X. (2019). Dynamic-SLAM: semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robotics and Autonomous Systems, 117, 1–16. https://doi.org/10.1016/j.robot.2019.03.012
  • Xie, W., Liu, P. X., & Zheng, M. (2020). Moving object segmentation and detection for robust RGBD-SLAM in dynamic environments. IEEE Transactions on Instrumentation and Measurement, 70(1557-9662), 1–8.
  • Xiong, N., Han, W., & Vandenberg, A. (2012). Green cloud computing schemes based on networks: a survey. IET Communications, 6(18), 3294–3300. https://doi.org/10.1049/iet-com.2011.0293
  • Yu, C., Liu, Z., Liu, X., Xie, F., Yang, Y., Wei, Q., & Fei, Q. (2018). DS-SLAM: a semantic visual SLAM towards dynamic environments. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1168–1174). IEEE.
  • Yuan, C., Xu, Y., & Zhou, Q. (2023). PLDS-SLAM: point and line features SLAM in dynamic environment. Remote Sensing, 15(7), 1893. https://doi.org/10.3390/rs15071893
  • Zeng, Y., Sreenan, C. J., Xiong, N., Yang, L. T., & Park, J. H. (2010). Connectivity and coverage maintenance in wireless sensor networks. The Journal of Supercomputing, 52, 23–46. https://doi.org/10.1007/s11227-009-0268-7
  • Zeng, Y., Xiong, N., Park, J. H., & Zheng, G. (2010). An emergency-adaptive routing scheme for wireless sensor networks for building fire hazard monitoring. Sensors, 10(6), 6128–6148. https://doi.org/10.3390/s100606128
  • Zhang, C., Huang, T., Zhang, R., & Yi, X. (2021). PLD-SLAM: a new RGB-D SLAM method with point and line features for indoor dynamic scene. ISPRS International Journal of Geo-Information, 10(3), 163. https://doi.org/10.3390/ijgi10030163
  • Zhang, W., Zhu, S., Tang, J., & Xiong, N. (2018). A novel trust management scheme based on Dempster–Shafer evidence theory for malicious nodes detection in wireless sensor networks. The Journal of Supercomputing, 74, 1779–1801. https://doi.org/10.1007/s11227-017-2150-3
  • Zhang, Y., Li, J., Xing, B., & Hu, X. (2020). Robust semantic optical flow visual odometry in dynamic environment. In 2020 4th Annual international conference on data science and business analytics (ICDSBA) (pp. 324–328). IEEE.
  • Zhao, J., Huang, J., & Xiong, N. (2019). An effective exponential-based trust and reputation evaluation system in wireless sensor networks. IEEE Access, 7, 33859–33869. https://doi.org/10.1109/Access.6287639
  • Zhou, Y., Zhang, Y., Liu, H., Xiong, N., & Vasilakos, A. V. (2012). A bare-metal and asymmetric partitioning approach to client virtualization. IEEE Transactions on Services Computing, 7(1), 40–53. https://doi.org/10.1109/TSC.2012.32