Monocular SLAM System in Dynamic Scenes Based on Semantic Segmentation

Sheng, Chao; Pan, Shuguo; Zeng, Pan; Huang, Lixiao; Zhao, Tao

doi:10.1007/978-3-030-34113-8_49

Chao Sheng¹⁴,
Shuguo Pan¹⁴,
Pan Zeng¹⁴,
Lixiao Huang¹⁴ &
…
Tao Zhao¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11903))

Included in the following conference series:

International Conference on Image and Graphics

2075 Accesses
2 Citations

Abstract

The traditional feature-based visual SLAM algorithm is based on the static environment assumption when recovering scene information and camera motion. The dynamic objects in the scene will affect the positioning accuracy. In this paper, we propose to combine the image semantic segmentation based on deep learning method with the traditional visual SLAM framework to reduce the interference of dynamic objects on the positioning results. Firstly, a supervised Convolutional Neural Network (CNN) is used to segment objects in the input image to obtain the semantic image. Secondly, the feature points are extracted from the original image, and the feature points of the dynamic objects (cars and pedestrians) are eliminated according to the semantic image. Finally, the traditional monocular SLAM method is used to track the camera motion based on the eliminated feature points. The experiments on the Apolloscape datasets show that compared with the traditional method, the proposed method improves the positioning accuracy in dynamic scenes by about 17%.

You have full access to this open access chapter, Download conference paper PDF

Visual SLAM Location Methods Based on Complex Scenes: A Review

Dynamic object removal by fusing deep learning and multiview geometry

Article 22 October 2024

RGB-D SLAM in Dynamic Environments with Multilevel Semantic Mapping

Article 08 August 2022

Keywords

1 Introduction

SLAM (simultaneous localization and mapping) is the key technology of robot autonomous operation in unknown environment. Based on the environment data detected by robot external sensors, SLAM constructs the surrounding environment map for the robot and provides the position of the robot in the environment map at the same time. Compared with ranging instruments such as radar and sonar, the visual sensor has the characteristics of small size, low power consumption and abundant information acquisition, it can provide rich texture information in the external environment. Therefore, visual SLAM has become the focus of current research, and been applied on autonomous navigation, VR/AR and other fields.

In recent years, many visual SLAM systems have been invented and show the impressive performance on localization and mapping. In 2007, Davision et al. [1] proposed MonoSLAM, which establishes the framework of a probabilistic visual SLAM system, and proposes a method for the initialization of monocular features and the estimation of feature direction. The algorithm is the pioneering work of monocular real-time visual SLAM. In 2007, David Murray et al. [2] proposed PTAM, the system uses two threads to separate feature tracking and map building, and adopts bundle adjustment based on keyframes for global optimization. Engel et al. [3] proposed LSD-SLA in 2014. It is a direct method using pixel information as much as possible instead of feature points to estimate camera motion, and the optimization is achieved by minimizing photometric error. The team published DSO [4] in 2016, which is one of the most effective visual odometries based on direct method. In 2015, Mur-Artal et al. [5] proposed the ORB- SLAM based on the PTAM framework. It uses ORB features to track and match, and adds automatic initialization, loop closing detection, relocalization based on bag of word and back-end optimization method. ORB-SLAM is one of the most effective visual SLAM algorithms. The team has proposed ORB-SLAM2 [6] later, which is not only suitable for monocular cameras, but also for stereo and RGB-D cameras.

However, all above SLAM systems are based on the static environment assumption meaning the target scene must keep stationary during processing. Dynamic objects in the scene have negative effect on positioning accuracy.

At present, the traditional visual SLAM algorithm based on feature points deals with simple dynamic scene problems by detecting dynamic points and marking them as outliers. ORB-SLAM reduces the effect of dynamic objects on positioning and mapping accuracy by RANSAC, chi-square test, keyframe method, and local map. The direct method deals with occlusion problem caused by dynamic objects by optimizing cost function. In 2013, Zhang et al. [7] proposed a novel representation and updating method for keyfame to adaptively model the dynamic environments, where the appearance or structure changes can be effectively detected and handled. In the same year, Zou et al. [8] introduced intercamera pose estimation and intercamera mapping to deal with dynamic objects by using multiple cameras in the localization and mapping process. With the development of deep learning technology, semantic information of the image has been explored to improving the performance of SLAM. Chen et al. [9] integrated CNN-based multiple object detection and traditional monocular SLAM to detect moving objects in the scene.

In this paper, we propose a novel method to improve localization accuracy of the feature-based visual SLAM algorithm in dynamic scenes: We firstly introduce semantic segmentation method based on deep learning to obtain the semantic image. Then a ORB detector is used for extracting feature points and dynamic features are eliminated according to the semantic image. Finally, we adapted traditional features-based SLAM framework to track the camera motion by using the eliminated feature points.

2 Approach

2.1 Image Semantic Segmentation

We adapt the ICNet [10] (Image Cascade Network) for segmenting dynamic objects. The system achieves the real-time image inference with decent result on a single GPU card, so it meets the requirement of SLAM for real-time performance. The overall structure of the network is shown in Fig. 1.

Three novel components of the network are as follows:

(1)
Cascade Image Input. Classical image semantic segmentation network like FCN is very time consuming for high-resolution images. ICNet takes cascade image input to overcome the shortcoming. In the top branch of ICNet, the original image is firstly downsampled to a 1/4 sized image, then fed into the PSPNet to obtain a 1/32 sized feature map which is a coarse prediction and misses a lot of details and boundaries. In the middle and bottom branches, a 1/2 sized image and the original image are used for recovering and refining the coarse prediction. Although the prediction of the top branch is coarse, it contains the most semantic parts. Therefore, the CNNs of the last two branches used for refining segmentation boundaries and details are light. The output feature maps of different branch are fused by ‘CFF’ unit (cascade feature fusion), and the cascade label guidance enhances the learning procedure in different branch.
(2)
Cascade Feature Fusion. ‘CFF’ unit is used for combining the output feature map of different branch, the structure of the unit is shown in Fig. 2. The input of the unit consists of two feature maps and a label, where F1 has the size $ H_{1} \times W_{1} \times C_{1} $, F2 has the size $ H_{2} \times W_{2} \times C_{2} $, and the label size is $ H_{1} \times W_{1} \times 1 $. The up-sampling with rate 2 is applied on F1 to obtain the same size as F2. Then a dilated convolution layer with kernel size $ 3 \times 3 \times C_{3} $ and dilation 2 is used for refining, so the output size of F1 becomes $ H_{2} \times W_{2} \times C_{3} $. For F2, in order to get the same size as the output of F1, a convolution with kernel size $ 1 \times 1 \times C_{3} $ is applied. Through two batch normalization layers and a ‘sum’ layers, the fused feature map F2 with size $ H_{2} \times W_{2} \times C_{3} $ is obtained. The label guidance is used for getting the auxiliary loss.
Fig. 2.
Structure of ‘CFF’ unit. F1 and F2 are feature maps of different branch, the spatial size of F2 is twice of F1.
Full size image
(3)
Cascade Label Guidance. As shown in Fig. 1, three ground truth labels in different resolution (1/16, 1/8, 1/4 of the original image resolution) are adopted to get three independent loss items in different branch, the strategy can enhance the learning procedure. The total loss function can be expressed as:
$$ \begin{array}{*{20}c} {L_{total} = - \mathop \sum \limits_{t = 1}^{3} \omega_{t} \frac{1}{{Y_{t} X_{t} }}\mathop \sum \limits_{y = 1}^{{Y_{t} }} \mathop \sum \limits_{x = 1}^{{X_{t} }} log\frac{{e^{{F_{{\tilde{n}, y,x}}^{t} }} }}{{\mathop \sum \nolimits_{n = 1}^{N} e^{{F_{n, y,x}^{t} }} }}} \\ \end{array} $$
(1)
$ \omega_{t} $ is the loss weight in each branch. The feature map of each branch $ F^{t} $ has the spatial size $ Y_{t} \times X_{t} $. $ N $ is objects category. The value at position $ \left( {n, y, x} \right) $ is $ F_{n, y,x}^{t} $, the corresponding ground truth label for 2D position $ \left( {y, x} \right) $ is $ \tilde{n} $.

In summary, compared to other segmentation networks which bring accurate segmentation results but take long runtime, ICNet can achieve the real-time image semantic segmentation with decent result, it is practical for SLAM system which operates in real time.

2.2 Feature Points Extraction and Elimination

Traditional featured-based visual SLAM method such as ORB-SLAM firstly extracts feature points from original image, then the sparse or dense 3D structure of the scene and the camera motion are restored by using the corresponding relation of the image feature points at different frames, and the scene is assumed to be static. In the back-end optimization, RANSAC iterations method or chi-square test is used for eliminating the outliers, However, if the scene is too complex, RANSAC or chi-square test will not be so reliable [9].

We directly eliminate the feature points of dynamic objects by the semantic image in the extraction stage. Firstly, we extract ORB feature points from the input image. Then the elimination unit is applied on culling feature points of dynamic objects (cars and pedestrians) based on the segmented image we already get from semantic section. The framework is shown in Fig. 3.

2.3 Camera Motion Tracking

After getting the eliminated feature points, we track the camera motion to get the positioning result based on the monocular ORB-SLAM framework which is a representative system at feature points category. The overall framework is shown in Fig. 4.

The brief introduction of the framework are as follows:

(1)
Tracking. The tracking thread is used for estimating the camera motion with every frame, and deciding the occasion of inserting a new keyframe. The initial pose is obtained by the feature matching between the current frame and last frame, then optimized by motion-only bundle adjustment. Note that chi-square test is used here for removing mismatches, it has a certain effect on removing feature matches of dynamic objects, but may failed when facing many dynamic objects. The relocalization is applied when the tracking is lost. After getting the initial pose, matches between feature points of current frame and local map are searched by projection, and the pose will be optimized again. In the end, the thread decides whether inserting a new keyframe or not.
(2)
Local Mapping. The local mapping thread is in charge of processing keyframes and achieving optimal sparse reconstruction by the local bundle adjustment. The unmatched ORB features in current keyframe are matched with the connected keyframes to triangulate new map points. In order to select high quality points and remove redundant keyframes, an exigent culling strategy is adapted.
(3)
Loop Closing. The loop closing searches for loops with each new keyframe. If a loop is detected, a similarity transformation is calculated, which informs about the accumulated drift in the loop. Then the two sides of the loop are aligned and duplicated points are fused. Finally, in order to achieve global consistency, a pose graph optimization over similarity constraints is performed.

3 Experiment and Analysis

In this paper, the feasibility and stability of the proposed algorithm in dynamic scenarios are verified by the scene analysis part of the public Apolloscape [11] automatic driving dataset. For the traditional SLAM algorithm based on the static environment assumption, the dynamic car traveling in the former dataset will destroy the robustness and positioning accuracy of the algorithm. The dataset contains multiple image sequences in different outdoor road scenes, and each picture is matched with high-precision pose information, which can be used to evaluate the output of the algorithm. The dataset contains three road scenes: road01, road02, and road03. Each road scene contains multiple segment records, such as Record001 and Record002. Each segment record contains binocular images. Because it is a monocular system, it is only used. The left camera image in the dataset for the monocular system. For the sake of simplicity, the data sequence of the data set is abbreviated, such as road01\Record067 as r01R067, and other sequences are similar.

This paper adopts monocular ORB-SLAM2 as the monocular SLAM framework. Since ORB-SLAM2 is one of the most outstanding and stable SLAM systems, the experimental results of this paper are compared with it.

All experiments are on a workstation with an Intel Xeon E5-2690V4 at 2.6 GHz with 128 GB of RAM and a Nvidia TitanV GPU card with 12 GB of VRAM.

3.1 Results of Semantic Segmentation

Results of segmentation is shown in Fig. 5.

3.2 Results of Feature Points Elimination

We extract ORB feature points from the input image, then eliminate feature points of dynamic objects based on the segmented image. Results of elimination is shown in Fig. 6.

3.3 Results of Positioning

Figure 7 is the plan views of the positioning trajectories of the r01R067 and r02R019 sequences in ORB-SLAM2 and our algorithm respectively. It can be seen from the figure that the estimated results of the two systems are basically coincident with the actual trajectory, but the deviation from the ground truth of our algorithm is smaller than that of ORB-SLAM2, and our positioning result is more accurate.

Figure 8 shows the error of the two algorithms in the X, Y, and Z directions and the absolute trajectory error (ATE) with time in r01R067 and r02R019 sequences. It can be seen that compared with ORB-SLAM2, both the trajectory error in the X, Y, and Z directions and the absolute trajectory error (ATE) of our algorithm are smaller. In about 82 s of r01R067 sequences, due to the large proportion of dynamic cars in the scene, the trajectory error of ORB-SLAM2 rises sharply, and the absolute trajectory error reaches 17.7 m.

Table 1 gives the absolute trajectory error statistics of the eight ApolloScape image sequences in our algorithm and ORB-SLAM2. It can be seen from the table that the positioning result of this paper is better than ORB-SLAM2, and the positioning accuracy is improved by about 17%.

Table 1. Comparison of positioning results of two algorithms on ApolloScape dataset.

Full size table

4 Conclusion

In this paper, aiming to reduce the negative influence caused by dynamic objects on the traditional feature-based SLAM algorithm, we introduce a novel method that combining the image semantic segmentation method based on deep learning with the traditional visual SLAM framework. Experiments on Apolloscape datasets shows that compared with full ORB-SLAM and incomplete ORB-SLAM, our method improves the positioning accuracy in dynamic scenes by about 13% and 31% respectively. The method of this paper still needs to be improved. We treat all cars and pedestrians in the scene as dynamic objects when doing feature points elimination, it is a waste for static objects if the condition of the car is switching between driving and stopping. We will focus on how to track the motion of not only the camera itself but also objects in the scene, then perform the features elimination policy by judging if objects are moving or not.

References

Davison, A.J., Reid, I.D., Molton, N.D., et al.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007)
Article Google Scholar
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: Proceedings of the Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2007), Nara, Japan. IEEE, November 2007
Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 611–625 (2016)
Article Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Rob. 31(5), 1147–1163 (2015)
Article Google Scholar
Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017)
Article Google Scholar
Tan, N.W., Liu, N.H., Dong, Z., et al.: Robust monocular SLAM in dynamic environments. In: 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE Computer Society (2013)
Google Scholar
Zou, D., Tan, P.: CoSLAM: collaborative visual SLAM in dynamic environments. IEEE Trans. Pattern Anal. Mach. Intell. 35(2), 354–366 (2012)
Article Google Scholar
Chen, W., Fang, M., Liu, Y.H., et al.: Monocular semantic SLAM in dynamic street scene based on multiple object tracking. In: IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), pp. 599–604. IEEE (2017)
Google Scholar
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: ICNet for real-time semantic segmentation on high-resolution images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 418–434. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_25
Chapter Google Scholar
Huang, X., Cheng, X., Geng, Q., et al.: The apolloscape dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 954–960 (2018)
Google Scholar

Download references

Acknowledgments

This research was supported by Jiangsu Surveying and Mapping Geographic Information Scientific Research Project (JSCHKY201808), National Key Research and Development Project (2016YFB0502101) and National Natural Science Foundation of China (41574026, 41774027).

Author information

Authors and Affiliations

School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
Chao Sheng, Shuguo Pan, Pan Zeng, Lixiao Huang & Tao Zhao

Authors

Chao Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Shuguo Pan
View author publications
You can also search for this author in PubMed Google Scholar
Pan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Lixiao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuguo Pan .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Peking, China
Baoquan Chen
The Technical University of Munich, München, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheng, C., Pan, S., Zeng, P., Huang, L., Zhao, T. (2019). Monocular SLAM System in Dynamic Scenes Based on Semantic Segmentation. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11903. Springer, Cham. https://doi.org/10.1007/978-3-030-34113-8_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-34113-8_49
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34112-1
Online ISBN: 978-3-030-34113-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)