Abstract
The traditional feature-based visual SLAM algorithm is based on the static environment assumption when recovering scene information and camera motion. The dynamic objects in the scene will affect the positioning accuracy. In this paper, we propose to combine the image semantic segmentation based on deep learning method with the traditional visual SLAM framework to reduce the interference of dynamic objects on the positioning results. Firstly, a supervised Convolutional Neural Network (CNN) is used to segment objects in the input image to obtain the semantic image. Secondly, the feature points are extracted from the original image, and the feature points of the dynamic objects (cars and pedestrians) are eliminated according to the semantic image. Finally, the traditional monocular SLAM method is used to track the camera motion based on the eliminated feature points. The experiments on the Apolloscape datasets show that compared with the traditional method, the proposed method improves the positioning accuracy in dynamic scenes by about 17%.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
SLAM (simultaneous localization and mapping) is the key technology of robot autonomous operation in unknown environment. Based on the environment data detected by robot external sensors, SLAM constructs the surrounding environment map for the robot and provides the position of the robot in the environment map at the same time. Compared with ranging instruments such as radar and sonar, the visual sensor has the characteristics of small size, low power consumption and abundant information acquisition, it can provide rich texture information in the external environment. Therefore, visual SLAM has become the focus of current research, and been applied on autonomous navigation, VR/AR and other fields.
In recent years, many visual SLAM systems have been invented and show the impressive performance on localization and mapping. In 2007, Davision et al. [1] proposed MonoSLAM, which establishes the framework of a probabilistic visual SLAM system, and proposes a method for the initialization of monocular features and the estimation of feature direction. The algorithm is the pioneering work of monocular real-time visual SLAM. In 2007, David Murray et al. [2] proposed PTAM, the system uses two threads to separate feature tracking and map building, and adopts bundle adjustment based on keyframes for global optimization. Engel et al. [3] proposed LSD-SLA in 2014. It is a direct method using pixel information as much as possible instead of feature points to estimate camera motion, and the optimization is achieved by minimizing photometric error. The team published DSO [4] in 2016, which is one of the most effective visual odometries based on direct method. In 2015, Mur-Artal et al. [5] proposed the ORB- SLAM based on the PTAM framework. It uses ORB features to track and match, and adds automatic initialization, loop closing detection, relocalization based on bag of word and back-end optimization method. ORB-SLAM is one of the most effective visual SLAM algorithms. The team has proposed ORB-SLAM2 [6] later, which is not only suitable for monocular cameras, but also for stereo and RGB-D cameras.
However, all above SLAM systems are based on the static environment assumption meaning the target scene must keep stationary during processing. Dynamic objects in the scene have negative effect on positioning accuracy.
At present, the traditional visual SLAM algorithm based on feature points deals with simple dynamic scene problems by detecting dynamic points and marking them as outliers. ORB-SLAM reduces the effect of dynamic objects on positioning and mapping accuracy by RANSAC, chi-square test, keyframe method, and local map. The direct method deals with occlusion problem caused by dynamic objects by optimizing cost function. In 2013, Zhang et al. [7] proposed a novel representation and updating method for keyfame to adaptively model the dynamic environments, where the appearance or structure changes can be effectively detected and handled. In the same year, Zou et al. [8] introduced intercamera pose estimation and intercamera mapping to deal with dynamic objects by using multiple cameras in the localization and mapping process. With the development of deep learning technology, semantic information of the image has been explored to improving the performance of SLAM. Chen et al. [9] integrated CNN-based multiple object detection and traditional monocular SLAM to detect moving objects in the scene.
In this paper, we propose a novel method to improve localization accuracy of the feature-based visual SLAM algorithm in dynamic scenes: We firstly introduce semantic segmentation method based on deep learning to obtain the semantic image. Then a ORB detector is used for extracting feature points and dynamic features are eliminated according to the semantic image. Finally, we adapted traditional features-based SLAM framework to track the camera motion by using the eliminated feature points.
2 Approach
2.1 Image Semantic Segmentation
We adapt the ICNet [10] (Image Cascade Network) for segmenting dynamic objects. The system achieves the real-time image inference with decent result on a single GPU card, so it meets the requirement of SLAM for real-time performance. The overall structure of the network is shown in Fig. 1.
Network structure of ICNet. Numbers in parentheses are size ratios to the original input image. ‘CFF’ is the cascade feature fusion unit. There are three branches. The first three layers of the top and middle branches share the same weights. Layers in green of the last two branches is light for high efficiency. Only the process pointed to by the black arrow is used for both training and testing. (Color figure online)
Three novel components of the network are as follows:
-
(1)
Cascade Image Input. Classical image semantic segmentation network like FCN is very time consuming for high-resolution images. ICNet takes cascade image input to overcome the shortcoming. In the top branch of ICNet, the original image is firstly downsampled to a 1/4 sized image, then fed into the PSPNet to obtain a 1/32 sized feature map which is a coarse prediction and misses a lot of details and boundaries. In the middle and bottom branches, a 1/2 sized image and the original image are used for recovering and refining the coarse prediction. Although the prediction of the top branch is coarse, it contains the most semantic parts. Therefore, the CNNs of the last two branches used for refining segmentation boundaries and details are light. The output feature maps of different branch are fused by ‘CFF’ unit (cascade feature fusion), and the cascade label guidance enhances the learning procedure in different branch.
-
(2)
Cascade Feature Fusion. ‘CFF’ unit is used for combining the output feature map of different branch, the structure of the unit is shown in Fig. 2. The input of the unit consists of two feature maps and a label, where F1 has the size \( H_{1} \times W_{1} \times C_{1} \), F2 has the size \( H_{2} \times W_{2} \times C_{2} \), and the label size is \( H_{1} \times W_{1} \times 1 \). The up-sampling with rate 2 is applied on F1 to obtain the same size as F2. Then a dilated convolution layer with kernel size \( 3 \times 3 \times C_{3} \) and dilation 2 is used for refining, so the output size of F1 becomes \( H_{2} \times W_{2} \times C_{3} \). For F2, in order to get the same size as the output of F1, a convolution with kernel size \( 1 \times 1 \times C_{3} \) is applied. Through two batch normalization layers and a ‘sum’ layers, the fused feature map F2 with size \( H_{2} \times W_{2} \times C_{3} \) is obtained. The label guidance is used for getting the auxiliary loss.
-
(3)
Cascade Label Guidance. As shown in Fig. 1, three ground truth labels in different resolution (1/16, 1/8, 1/4 of the original image resolution) are adopted to get three independent loss items in different branch, the strategy can enhance the learning procedure. The total loss function can be expressed as:
$$ \begin{array}{*{20}c} {L_{total} = - \mathop \sum \limits_{t = 1}^{3} \omega_{t} \frac{1}{{Y_{t} X_{t} }}\mathop \sum \limits_{y = 1}^{{Y_{t} }} \mathop \sum \limits_{x = 1}^{{X_{t} }} log\frac{{e^{{F_{{\tilde{n}, y,x}}^{t} }} }}{{\mathop \sum \nolimits_{n = 1}^{N} e^{{F_{n, y,x}^{t} }} }}} \\ \end{array} $$(1)\( \omega_{t} \) is the loss weight in each branch. The feature map of each branch \( F^{t} \) has the spatial size \( Y_{t} \times X_{t} \). \( N \) is objects category. The value at position \( \left( {n, y, x} \right) \) is \( F_{n, y,x}^{t} \), the corresponding ground truth label for 2D position \( \left( {y, x} \right) \) is \( \tilde{n} \).
In summary, compared to other segmentation networks which bring accurate segmentation results but take long runtime, ICNet can achieve the real-time image semantic segmentation with decent result, it is practical for SLAM system which operates in real time.
2.2 Feature Points Extraction and Elimination
Traditional featured-based visual SLAM method such as ORB-SLAM firstly extracts feature points from original image, then the sparse or dense 3D structure of the scene and the camera motion are restored by using the corresponding relation of the image feature points at different frames, and the scene is assumed to be static. In the back-end optimization, RANSAC iterations method or chi-square test is used for eliminating the outliers, However, if the scene is too complex, RANSAC or chi-square test will not be so reliable [9].
We directly eliminate the feature points of dynamic objects by the semantic image in the extraction stage. Firstly, we extract ORB feature points from the input image. Then the elimination unit is applied on culling feature points of dynamic objects (cars and pedestrians) based on the segmented image we already get from semantic section. The framework is shown in Fig. 3.
2.3 Camera Motion Tracking
After getting the eliminated feature points, we track the camera motion to get the positioning result based on the monocular ORB-SLAM framework which is a representative system at feature points category. The overall framework is shown in Fig. 4.
The brief introduction of the framework are as follows:
-
(1)
Tracking. The tracking thread is used for estimating the camera motion with every frame, and deciding the occasion of inserting a new keyframe. The initial pose is obtained by the feature matching between the current frame and last frame, then optimized by motion-only bundle adjustment. Note that chi-square test is used here for removing mismatches, it has a certain effect on removing feature matches of dynamic objects, but may failed when facing many dynamic objects. The relocalization is applied when the tracking is lost. After getting the initial pose, matches between feature points of current frame and local map are searched by projection, and the pose will be optimized again. In the end, the thread decides whether inserting a new keyframe or not.
-
(2)
Local Mapping. The local mapping thread is in charge of processing keyframes and achieving optimal sparse reconstruction by the local bundle adjustment. The unmatched ORB features in current keyframe are matched with the connected keyframes to triangulate new map points. In order to select high quality points and remove redundant keyframes, an exigent culling strategy is adapted.
-
(3)
Loop Closing. The loop closing searches for loops with each new keyframe. If a loop is detected, a similarity transformation is calculated, which informs about the accumulated drift in the loop. Then the two sides of the loop are aligned and duplicated points are fused. Finally, in order to achieve global consistency, a pose graph optimization over similarity constraints is performed.
3 Experiment and Analysis
In this paper, the feasibility and stability of the proposed algorithm in dynamic scenarios are verified by the scene analysis part of the public Apolloscape [11] automatic driving dataset. For the traditional SLAM algorithm based on the static environment assumption, the dynamic car traveling in the former dataset will destroy the robustness and positioning accuracy of the algorithm. The dataset contains multiple image sequences in different outdoor road scenes, and each picture is matched with high-precision pose information, which can be used to evaluate the output of the algorithm. The dataset contains three road scenes: road01, road02, and road03. Each road scene contains multiple segment records, such as Record001 and Record002. Each segment record contains binocular images. Because it is a monocular system, it is only used. The left camera image in the dataset for the monocular system. For the sake of simplicity, the data sequence of the data set is abbreviated, such as road01\Record067 as r01R067, and other sequences are similar.
This paper adopts monocular ORB-SLAM2 as the monocular SLAM framework. Since ORB-SLAM2 is one of the most outstanding and stable SLAM systems, the experimental results of this paper are compared with it.
All experiments are on a workstation with an Intel Xeon E5-2690V4 at 2.6 GHz with 128 GB of RAM and a Nvidia TitanV GPU card with 12 GB of VRAM.
3.1 Results of Semantic Segmentation
Results of segmentation is shown in Fig. 5.
Results of semantic segmentation. The middle column shows that trees, building, road, traffic signs and other objects in the scene are decently segmented. Right side only preserves the segmentation results of the dynamic objects (cars and pedestrians). Although the boundaries are not perfectly accurate, the result is enough for feature points elimination.
3.2 Results of Feature Points Elimination
We extract ORB feature points from the input image, then eliminate feature points of dynamic objects based on the segmented image. Results of elimination is shown in Fig. 6.
Results of feature points elimination. The white car is a dynamic object which is moving on the road. The four images in the left column show the result before elimination process, there are many feature points (green masks) that belong to the dynamic car. The right column shows the result after eliminating, we can see that the car’s feature points have been completely culled. The images have been clipped for a clearer display. (Color figure online)
3.3 Results of Positioning
Figure 7 is the plan views of the positioning trajectories of the r01R067 and r02R019 sequences in ORB-SLAM2 and our algorithm respectively. It can be seen from the figure that the estimated results of the two systems are basically coincident with the actual trajectory, but the deviation from the ground truth of our algorithm is smaller than that of ORB-SLAM2, and our positioning result is more accurate.
Figure 8 shows the error of the two algorithms in the X, Y, and Z directions and the absolute trajectory error (ATE) with time in r01R067 and r02R019 sequences. It can be seen that compared with ORB-SLAM2, both the trajectory error in the X, Y, and Z directions and the absolute trajectory error (ATE) of our algorithm are smaller. In about 82 s of r01R067 sequences, due to the large proportion of dynamic cars in the scene, the trajectory error of ORB-SLAM2 rises sharply, and the absolute trajectory error reaches 17.7 m.
Table 1 gives the absolute trajectory error statistics of the eight ApolloScape image sequences in our algorithm and ORB-SLAM2. It can be seen from the table that the positioning result of this paper is better than ORB-SLAM2, and the positioning accuracy is improved by about 17%.
4 Conclusion
In this paper, aiming to reduce the negative influence caused by dynamic objects on the traditional feature-based SLAM algorithm, we introduce a novel method that combining the image semantic segmentation method based on deep learning with the traditional visual SLAM framework. Experiments on Apolloscape datasets shows that compared with full ORB-SLAM and incomplete ORB-SLAM, our method improves the positioning accuracy in dynamic scenes by about 13% and 31% respectively. The method of this paper still needs to be improved. We treat all cars and pedestrians in the scene as dynamic objects when doing feature points elimination, it is a waste for static objects if the condition of the car is switching between driving and stopping. We will focus on how to track the motion of not only the camera itself but also objects in the scene, then perform the features elimination policy by judging if objects are moving or not.
References
Davison, A.J., Reid, I.D., Molton, N.D., et al.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007)
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: Proceedings of the Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2007), Nara, Japan. IEEE, November 2007
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2016)
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Rob. 31(5), 1147–1163 (2015)
Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017)
Tan, N.W., Liu, N.H., Dong, Z., et al.: Robust monocular SLAM in dynamic environments. In: 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE Computer Society (2013)
Zou, D., Tan, P.: CoSLAM: collaborative visual SLAM in dynamic environments. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 354–366 (2012)
Chen, W., Fang, M., Liu, Y.H., et al.: Monocular semantic SLAM in dynamic street scene based on multiple object tracking. In: IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), pp. 599–604. IEEE (2017)
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 418–434. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_25
Huang, X., Cheng, X., Geng, Q., et al.: The apolloscape dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 954–960 (2018)
Acknowledgments
This research was supported by Jiangsu Surveying and Mapping Geographic Information Scientific Research Project (JSCHKY201808), National Key Research and Development Project (2016YFB0502101) and National Natural Science Foundation of China (41574026, 41774027).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sheng, C., Pan, S., Zeng, P., Huang, L., Zhao, T. (2019). Monocular SLAM System in Dynamic Scenes Based on Semantic Segmentation. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11903. Springer, Cham. https://doi.org/10.1007/978-3-030-34113-8_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-34113-8_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34112-1
Online ISBN: 978-3-030-34113-8
eBook Packages: Computer ScienceComputer Science (R0)