Abstract
Autonomous robot navigation in unstructured outdoor environment is still a challenging problem, and the terrain segmentation is one of the key tasks in robot navigation. Previous methods work well on common terrains like urban roads, but tend to fail in wild conditions due to different illumination, weather and road variations. In this paper, we propose a novel two branches terrain segmentation network based on disparity map and ground plane fitting, introducing geometric characteristics into the network. The terrain segmentation main branch uses convolutional feature layers with multiple sampling rates filters, which effectively considers local and global context information and smooths the holey information in the disparity map. The enhancement branch exploits plane geometry property of the ground plane deviation map calculated from the disparity map, which adaptively generates reference feature maps for improving the robustness of identifying traversable areas under conditions of unseen terrains. Experimental results demonstrate excellent performance of the proposed method on terrain segmentation both qualitatively and quantitatively.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Autonomous robot navigation in unstructured outdoor environment is still an open and challenging problem. The terrain segmentation is one of the core tasks in robot navigation, which is key to the robot to identify traversable areas and avoid obstacles. Unlike the urban roads with clear marking lines, terrain in unstructured outdoor environment is complicated, featured by various combinations of ground types and obstacles. As illustrated in Fig. 1(a), illumination condition causes shadows and oversaturation, in addition, obstacles (trees, haystacks) have high visual similarities to a dirt road surface with foliage. These bring great challenges to terrain segmentation.
In human biological vision system, stereo disparity plays an important role in scene perception, and it can be adopted by machine vision systems in autonomous robot navigation. Thus lots of road segmentation algorithms [11, 17, 21, 24] have been developed based on stereo disparity information. For example, Zhu et al. [24] propose a traversable region detection algorithm in indoor and urban environment by introducing UV-disparity. However, inaccurate estimations in the process of feature extraction and stereo matching tend to bring holey and noisy disparity maps (see Fig. 1(b)). Therefore, it is important to connect global and local information for smoothing holes and noises when applying disparity maps.
With the rapid development of deep learning technology, fully convolutional networks [13] are driving advances in semantic segmentation. Many excellent research efforts [1, 2, 6, 12, 14, 23] have improved the accuracy on exposed standard datasets, such as PASCAL VOC [5], Cityscapes [4] and KITTI Road [7]. They work well on common and regular terrains like urban roads or highways, but may lead to failures in unstructured natural scene with changing illumination, weather, and road conditions etc. (see Table 1 Segnet [1] and Baseline-RGB). This is because the network trained on a particular kind of dataset is not flexible enough to adapt different and unseen road conditions. Shashank et al. [20] combine deep convolutional neural network (CNN) with color lines model based prior in a conditional random field framework to adapt to varying illumination conditions, but it fails when the color of the road is close to the surrounding environment. In the practical application of robot navigation, the scene will change over time. For these reasons, it is necessary to study an adaptive and robust terrain segmentation algorithm.
DARPA LAGR program [8] have inspired researches [15, 17, 18, 21] to focus on unstructured terrain segmentation. Procopio et al. [17] obtain stereo labels by ground plane fitting. They compute the difference between predicted ground plane disparity and observed disparity from the stereo readings. Thresholds are used directly to determine whether pixels in the image belong to ground. As shown in the Fig. 1(c), there are some noises and discontinuities in the ground plane deviation map, so the results obtained by threshold segmentation are not applicable. We are inspired by the ground plane fitting techniques [3, 17] and proposed a novel method to improve terrain segmentation accuracy.
In this paper, we proposed a disparity-based robust unstructured terrain segmentation network. We first do the calculation of ground plane fitting and deviation, then we choose to use the disparity map and the ground plane deviation map as network inputs instead of the color image. They have stable distributions on different datasets and imply certain plane geometry property, which can help to identify traversable areas and avoid obstacles in the situation where the appearances of the roads have changed greatly. In addition, the segmentation module with multiple sampling rates filters is a powerful visual model that extracts hierarchies of features and incorporates local and global context, which can smooth the holey information in the disparity map. Moreover, the enhancement module adaptively generates reference feature maps for improving the robustness of terrain segmentation results.
The process of terrain segmentation is shown in Fig. 1. For a given unstructured natural scene, stereo disparity map is provided by strongly calibrated Point Grey Research Stereo Rigs. A ground plane model is fitted and subtracted out for obtaining the ground plane deviation map. The trained two branches network is applied to terrain scenes with varying appearances and demonstrates excellent terrain segmentation performance.
2 Proposed Methodology
This section is divided into two major parts. We first give the theory of ground plane fitting and the calculation of ground plane deviation, and then we describe the design methodology of the proposed network.
2.1 Ground Plane Fitting and Deviation Calculation
Stereo vision refers to the inferring of 3D structure from two images taken from different viewpoints. In this research, Stereo disparity and depth data are provided by strongly calibrated Point Grey Research Stereo Rigs. There are many different algorithms to estimate disparity map, but in this paper we don’t focus on how to estimate it, but on how to apply the stereo disparity with noise and holes effectively.
Disparity plays an important role of depth and geometric cue in machine vision systems in autonomous robot navigation. Assuming a calibrated stereo camera system with baseline length L and focal length f, and the X, Y, Z axes of the camera coordinate system are aligned with image axes x, y and camera optical axis separately, then the relationship between disparity \(\delta \) and depth d can be expressed as:
The plane P in a camera coordinate system can be expressed as:
where A, B, C, D are plane parameters. According to the principle of perspective projection and similarity transformation, we can compute an initial estimation of the plane in disparity space that corresponds to a world plane relative to the camera:
where u, v are pixel coordinates in the image coordinate system and \(\alpha , \beta , \gamma \) are plane parameters.
Therefore, given a disparity map, the corresponding ground plane can be fitted to image points that have been stereo matched without knowing the intrinsic camera parameters. Then we obtain the ground plane deviation map \(I_{dev}\) by calculating the difference between fitted disparity \(\delta _f\) of ground plane and given disparity \(\delta _g\):
We conduct a statistically analysis and comparison using natural data taken from the outdoor environment. For color images, it is important to note that they have some similarities in texture and color when the scene remains the same, but they are going to be significantly different when the scene changes. However, the distributions of disparity maps and ground plane deviation maps in changing conditions are more stable and similar. Consequently, we choose to use the disparity map and the ground plane deviation map as network inputs instead of the color image, introducing their plane geometry property into the network.
2.2 Network Architecture
The proposed network architecture consists of two sub-networks: a terrain segmentation main network and a stability and adaptability enhancement module as shown in Fig. 2. The two modules complement each other and show excellent terrain segmentation results.
Segmentation Module. The terrain segmentation main network is designed based on the Deeplab model [2] by taking a disparity map as input. In our setting, \(1 \times 1\) convolutional layers are used as dimension reduction modules to remove computational bottlenecks. This design allows for increasing the stability and adaptability enhancement module without significant performance penalty. What’s more, we use the multi-sampling rates dilated filters (represented by DConv in Fig. 2), which can be explained by the following formula:
where F are convolution features, W are the filter weights, r is the sampling rate, and H are the output features. Four dilated convolutions with different sampling rates (2, 4, 6, 8) are applied in parallel as a pyramid structure, which effectively consider local and global context information for smoothing the holey and noisy information in the disparity map and ground plane deviation map.
Enhancement Module. The enhancement module operates on the ground plane deviation map, which is obtained by calculating the difference between the fitted disparity of ground plane and the given disparity. As an auxiliary input, the ground plane deviation map can improve the adaptability and robustness of terrain segmentation with changing illumination, weather, and road conditions etc. Because the ground plane deviation map has stable distributions on different datasets and implies certain plane geometry property.
The proposed enhancement module is clearly different from the threshold discrimination method used in [21]. They feed the noisy and discontinuous ground reference maps obtained by threshold directly into the network decoder, resulting in poor segmentation (See the third line of Fig. 5). However, we respectively use one \(3 \times 3\) convolutional layer with half dimensions of each corresponding scale layer (1, 1/2, 1/4, 1/8) in the segmentation main network to adaptively generate reference feature maps. Afterwards, the reference feature maps are concatenated to the corresponding layers of the segmentation module, improving the robustness of terrain segmentation results.
In our design, overlapping pooling [10] (\(kernel~size = 3, stride = 2\)) is used to shrink the size of the feature maps while retaining representative features. During the pooling process, the holey information will gradually decrease or even disappear. Besides, training models with overlapping pooling is slightly more difficult to overfit. Overall, the stability and adaptability enhancement module is similar to a residual learning module which enables fast and stable training.
3 Experiments
3.1 Datasets and Training Details
The natural data sets used here are taken from LAGR program and have been shown to contain time-varying concepts [15, 16]. Representative images are shown in Fig. 3. Generally, three scenarios are considered, and each scenario is associated with two distinct image sequences. The terrains appearing in the six datasets vary greatly and include various combinations of ground types, natural obstacles and man-made obstacles. Illumination conditions range from overcast with good color definition to very sunny, causing shadows and saturation. Each pixel is labeled as one of the three classes: ground plane, obstacle, or unknown.
In this research, Stereo disparity and depth data are provided by strongly calibrated Point Grey Research Stereo Rigs. Our implementation is based on the public platform Caffe [9]. We employ the modified version of the 16-layer DeepLab from [2] for the segmentation model. It is initialized by the VGG-16 [19] pretrained on the ImageNet. We use the ‘poly’ learning rate policy. The base learning rate is set to 0.001 and the power is set to 0.9. The number of the iterations is set to 10000 and there is no data augmentation strategy using in our experiments. In addition, the unknown labels are regarded as obstacles to provide a fair comparison with existing methods.
The proposed two branches terrain segmentation network can run in real time. The average speed is about 14 frames per second with one NVIDIA Titan X graphic card. The method still has room for speed improvement by optimizing the network structure or upgrading the hardware equipments.
3.2 Results and Discussion
The performance metric used in this evaluation is the root mean square error (RMSE), where lower scores are better.
where \(p_1, p_2, \ldots , p_N\) are the predictions on [0, 1] for a set of N test points and \(l_1, l_2, \ldots , l_N\) are the corresponding class labels in \(\{0,1\}\). Used in this manner, RMSE measures the error between the predicted terrain class \(p_i\) as output of our network, which is probabilistic, and the actual class label \(l_i\) determined by a human, which is discrete.
The terrain segmentation main network operates on color images, disparity images and ground plane deviation images respectively, recorded as baseline-RGB, baseline-Disparity and baseline-Deviation. We can observe that the proposed terrain segmentation method is more stable (by analyzing Table 1 horizontally) and accurate (by analyzing Table 1 vertically) than our baselines. It’s because we calculate the ground plane deviation from the disparity, whose stable distributions on different datasets and certain plane geometry property can help to identify safe areas in the situation where the appearance of the scene changes greatly. Experiments show that using the disparity map and the ground plane deviation map as network inputs instead of the color image is effective. Contrastively, when we use a combination of RGB and deviation, the root mean square error is increased by \(0.3\%\).
Table 2 shows the comparison of terrain segmentation results. Different from the training strategy using all six datasets proposed by Wei et al. [21] and others, we just choose the first dataset DS1A for training and evaluate the performance over all six datasets. It is important to note that DS1A is a small-scale training set and differs significantly from the testing sets which have diverse natural and road conditions. Procopio et al. [18] use sample balance methods and train the model on the near field data of the image, which does not suffer from the scene change problem. Their method performs well on DS3 (0.104, 0.139), but is not suitable for other datasets, especially on DS2B (0.676). We achieve a precision of 0.066 in a situation where the scene remains the same. Overall, the proposed two branches terrain segmentation network based on disparity map and ground plane fitting is more stable and accurate than others by introducing geometric characteristics into the network and generating reference feature maps adaptively.
Besides, we use the multiple sampling rates filters to incorporate larger context and smooth the holey information, offering a 1–\(2\%\) decrease in RMSE (see Tabel 3 for details).
As shown in Fig. 5, the qualitative results showing significant visual improvement compared with other methods. For test data from different datasets, our results are closest to the ground truth, and have better smoothness and continuity than others. We can observe that our method performed better than others in most cases under changing lighting conditions, weather, and variable road conditions etc. even though we only use a small amount of data DS1A for training. To be easier to see how well a ground plane segmentation has been achieved, we show overlapping maps of the result images with raw images in Fig. 4.
4 Conclusion
In this paper, we propose a novel two branches terrain segmentation network based on disparity map and ground plane fitting in order to produce accurate terrain segmentation results under different illumination, weather and road conditions. It employs the characteristics of deep convolution neural network and the geometric property of disparity map. It not only can accurately distinguish the ground areas and obstacles, but also has better regional consistency and smoothness. Experimental results demonstrate excellent terrain segmentation performance in variable and challenging scenes.
References
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 834 (2017)
Chumerin, N., Van Hulle, M.: Ground plane estimation based on dense stereo disparity (2008)
Cordts, M., et al.: The cityscapes dataset. In: CVPR Workshop on the Future of Datasets in Vision, vol. 1, p. 3 (2015)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857 (2017)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Jackel, L.D., Krotkov, E., Perschbacher, M., Pippine, J., Sullivan, C.: The DARPA LAGR program: goals, challenges, methodology, and phase I results. J. Field Robot. 23(11–12), 945–973 (2006)
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Li, F., Brady, J., Reid, I., Hu, H.: Parallel image processing for object tracking using disparity information. In: Second Asian Conference on Computer Vision ACCV 1995, pp. 762–766 (1995)
Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)
Procopio, M.J.: An experimental analysis of classifier ensembles for learning drifting concepts over time in autonomous outdoor robot navigation (2007)
Procopio, M.J.: Hand-labeled DARPA LAGR datasets (2007)
Procopio, M.J., Mulligan, J., Grudic, G.: Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments. J. Field Robot. 26(2), 145–175 (2009)
Procopio, M.J., Mulligan, J., Grudic, G.: Coping with imbalanced training data for improved terrain prediction in autonomous outdoor robot navigation. In: 2010 IEEE International Conference on Robotics and Automation (ICRA), pp. 518–525. IEEE (2010)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Yadav, S., Patra, S., Arora, C., Banerjee, S.: Deep CNN with color lines model for unmarked road segmentation. In: IEEE International Conference on Image Processing (ICIP 2017), Beijing (2017)
Zhang, W., Chen, Q., Zhang, W., He, X.: Long-range terrain perception using convolutional neural networks. Neurocomputing 275, 781–787 (2018)
Zhang, W., Zhang, W., Li, Z., Gu, J.: Visual features for long-range terrain perception. Robot 3, 015 (2015)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. arXiv preprint arXiv:1612.01105 (2016)
Zhu, X., Lu, H., Yang, X., Li, Y., Zhang, H.: Stereo vision based traversable region detection for mobile robots using UV-disparity. In: 2013 32nd Chinese Control Conference (CCC), pp. 5785–5790. IEEE (2013)
Acknowledgment
This work is partially supported by the National Natural Science Foundation of China (NSFC) under Grants 61720106005, 61472059 and 61772018. The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, P., Ma, X., Wang, Z., Li, H., Luo, Z. (2018). Disparity-Based Robust Unstructured Terrain Segmentation. In: Lai, JH., et al. Pattern Recognition and Computer Vision. PRCV 2018. Lecture Notes in Computer Science(), vol 11259. Springer, Cham. https://doi.org/10.1007/978-3-030-03341-5_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-03341-5_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03340-8
Online ISBN: 978-3-030-03341-5
eBook Packages: Computer ScienceComputer Science (R0)