Keywords

1 Introduction

Autonomous robot navigation in unstructured outdoor environment is still an open and challenging problem. The terrain segmentation is one of the core tasks in robot navigation, which is key to the robot to identify traversable areas and avoid obstacles. Unlike the urban roads with clear marking lines, terrain in unstructured outdoor environment is complicated, featured by various combinations of ground types and obstacles. As illustrated in Fig. 1(a), illumination condition causes shadows and oversaturation, in addition, obstacles (trees, haystacks) have high visual similarities to a dirt road surface with foliage. These bring great challenges to terrain segmentation.

In human biological vision system, stereo disparity plays an important role in scene perception, and it can be adopted by machine vision systems in autonomous robot navigation. Thus lots of road segmentation algorithms [11, 17, 21, 24] have been developed based on stereo disparity information. For example, Zhu et al. [24] propose a traversable region detection algorithm in indoor and urban environment by introducing UV-disparity. However, inaccurate estimations in the process of feature extraction and stereo matching tend to bring holey and noisy disparity maps (see Fig. 1(b)). Therefore, it is important to connect global and local information for smoothing holes and noises when applying disparity maps.

Fig. 1.
figure 1

The process of terrain segmentation: Given an unstructured natural scene (a), stereo disparity map (b) is provided by strongly calibrated Point Grey Research Stereo Rigs. A ground plane model is fitted and subtracted out, obtaining a ground plane deviation map (c). The proposed two branches network is applied to arrive at the final terrain segmentation result (d).

With the rapid development of deep learning technology, fully convolutional networks [13] are driving advances in semantic segmentation. Many excellent research efforts [1, 2, 6, 12, 14, 23] have improved the accuracy on exposed standard datasets, such as PASCAL VOC [5], Cityscapes [4] and KITTI Road [7]. They work well on common and regular terrains like urban roads or highways, but may lead to failures in unstructured natural scene with changing illumination, weather, and road conditions etc. (see Table 1 Segnet [1] and Baseline-RGB). This is because the network trained on a particular kind of dataset is not flexible enough to adapt different and unseen road conditions. Shashank et al. [20] combine deep convolutional neural network (CNN) with color lines model based prior in a conditional random field framework to adapt to varying illumination conditions, but it fails when the color of the road is close to the surrounding environment. In the practical application of robot navigation, the scene will change over time. For these reasons, it is necessary to study an adaptive and robust terrain segmentation algorithm.

DARPA LAGR program [8] have inspired researches [15, 17, 18, 21] to focus on unstructured terrain segmentation. Procopio et al. [17] obtain stereo labels by ground plane fitting. They compute the difference between predicted ground plane disparity and observed disparity from the stereo readings. Thresholds are used directly to determine whether pixels in the image belong to ground. As shown in the Fig. 1(c), there are some noises and discontinuities in the ground plane deviation map, so the results obtained by threshold segmentation are not applicable. We are inspired by the ground plane fitting techniques [3, 17] and proposed a novel method to improve terrain segmentation accuracy.

In this paper, we proposed a disparity-based robust unstructured terrain segmentation network. We first do the calculation of ground plane fitting and deviation, then we choose to use the disparity map and the ground plane deviation map as network inputs instead of the color image. They have stable distributions on different datasets and imply certain plane geometry property, which can help to identify traversable areas and avoid obstacles in the situation where the appearances of the roads have changed greatly. In addition, the segmentation module with multiple sampling rates filters is a powerful visual model that extracts hierarchies of features and incorporates local and global context, which can smooth the holey information in the disparity map. Moreover, the enhancement module adaptively generates reference feature maps for improving the robustness of terrain segmentation results.

The process of terrain segmentation is shown in Fig. 1. For a given unstructured natural scene, stereo disparity map is provided by strongly calibrated Point Grey Research Stereo Rigs. A ground plane model is fitted and subtracted out for obtaining the ground plane deviation map. The trained two branches network is applied to terrain scenes with varying appearances and demonstrates excellent terrain segmentation performance.

2 Proposed Methodology

This section is divided into two major parts. We first give the theory of ground plane fitting and the calculation of ground plane deviation, and then we describe the design methodology of the proposed network.

2.1 Ground Plane Fitting and Deviation Calculation

Stereo vision refers to the inferring of 3D structure from two images taken from different viewpoints. In this research, Stereo disparity and depth data are provided by strongly calibrated Point Grey Research Stereo Rigs. There are many different algorithms to estimate disparity map, but in this paper we don’t focus on how to estimate it, but on how to apply the stereo disparity with noise and holes effectively.

Disparity plays an important role of depth and geometric cue in machine vision systems in autonomous robot navigation. Assuming a calibrated stereo camera system with baseline length L and focal length f, and the X, Y, Z axes of the camera coordinate system are aligned with image axes x, y and camera optical axis separately, then the relationship between disparity \(\delta \) and depth d can be expressed as:

$$\begin{aligned} \delta =\frac{L\cdot f}{d} \end{aligned}$$
(1)

The plane P in a camera coordinate system can be expressed as:

$$\begin{aligned} P: AX+BY+CZ+D=0 \end{aligned}$$
(2)

where A, B, C, D are plane parameters. According to the principle of perspective projection and similarity transformation, we can compute an initial estimation of the plane in disparity space that corresponds to a world plane relative to the camera:

$$\begin{aligned} \delta =\alpha u+\beta v+\gamma \end{aligned}$$
(3)

where u, v are pixel coordinates in the image coordinate system and \(\alpha , \beta , \gamma \) are plane parameters.

Therefore, given a disparity map, the corresponding ground plane can be fitted to image points that have been stereo matched without knowing the intrinsic camera parameters. Then we obtain the ground plane deviation map \(I_{dev}\) by calculating the difference between fitted disparity \(\delta _f\) of ground plane and given disparity \(\delta _g\):

$$\begin{aligned} I_{dev}=\delta _g-\delta _f \end{aligned}$$
(4)

We conduct a statistically analysis and comparison using natural data taken from the outdoor environment. For color images, it is important to note that they have some similarities in texture and color when the scene remains the same, but they are going to be significantly different when the scene changes. However, the distributions of disparity maps and ground plane deviation maps in changing conditions are more stable and similar. Consequently, we choose to use the disparity map and the ground plane deviation map as network inputs instead of the color image, introducing their plane geometry property into the network.

2.2 Network Architecture

The proposed network architecture consists of two sub-networks: a terrain segmentation main network and a stability and adaptability enhancement module as shown in Fig. 2. The two modules complement each other and show excellent terrain segmentation results.

Fig. 2.
figure 2

The proposed novel terrain segmentation network with enhancement module. The disparity map and the ground plane deviation map are used as network inputs to introduce their stable distribution and geometric characteristics into the network. The segmentation module with multiple sampling rates filters extracts hierarchies of features and incorporates local and global context, which can smooth the holey information in the disparity map. Moreover, the enhancement module adaptively generates reference feature maps for improving the robustness of terrain segmentation results.

Segmentation Module. The terrain segmentation main network is designed based on the Deeplab model [2] by taking a disparity map as input. In our setting, \(1 \times 1\) convolutional layers are used as dimension reduction modules to remove computational bottlenecks. This design allows for increasing the stability and adaptability enhancement module without significant performance penalty. What’s more, we use the multi-sampling rates dilated filters (represented by DConv in Fig. 2), which can be explained by the following formula:

$$\begin{aligned} H(x,y)=\sum _{i,j}F(x+i\cdot r,y+j\cdot r) W(i,j) \end{aligned}$$
(5)

where F are convolution features, W are the filter weights, r is the sampling rate, and H are the output features. Four dilated convolutions with different sampling rates (2, 4, 6, 8) are applied in parallel as a pyramid structure, which effectively consider local and global context information for smoothing the holey and noisy information in the disparity map and ground plane deviation map.

Enhancement Module. The enhancement module operates on the ground plane deviation map, which is obtained by calculating the difference between the fitted disparity of ground plane and the given disparity. As an auxiliary input, the ground plane deviation map can improve the adaptability and robustness of terrain segmentation with changing illumination, weather, and road conditions etc. Because the ground plane deviation map has stable distributions on different datasets and implies certain plane geometry property.

The proposed enhancement module is clearly different from the threshold discrimination method used in [21]. They feed the noisy and discontinuous ground reference maps obtained by threshold directly into the network decoder, resulting in poor segmentation (See the third line of Fig. 5). However, we respectively use one \(3 \times 3\) convolutional layer with half dimensions of each corresponding scale layer (1, 1/2, 1/4, 1/8) in the segmentation main network to adaptively generate reference feature maps. Afterwards, the reference feature maps are concatenated to the corresponding layers of the segmentation module, improving the robustness of terrain segmentation results.

In our design, overlapping pooling [10] (\(kernel~size = 3, stride = 2\)) is used to shrink the size of the feature maps while retaining representative features. During the pooling process, the holey information will gradually decrease or even disappear. Besides, training models with overlapping pooling is slightly more difficult to overfit. Overall, the stability and adaptability enhancement module is similar to a residual learning module which enables fast and stable training.

3 Experiments

3.1 Datasets and Training Details

The natural data sets used here are taken from LAGR program and have been shown to contain time-varying concepts [15, 16]. Representative images are shown in Fig. 3. Generally, three scenarios are considered, and each scenario is associated with two distinct image sequences. The terrains appearing in the six datasets vary greatly and include various combinations of ground types, natural obstacles and man-made obstacles. Illumination conditions range from overcast with good color definition to very sunny, causing shadows and saturation. Each pixel is labeled as one of the three classes: ground plane, obstacle, or unknown.

In this research, Stereo disparity and depth data are provided by strongly calibrated Point Grey Research Stereo Rigs. Our implementation is based on the public platform Caffe [9]. We employ the modified version of the 16-layer DeepLab from [2] for the segmentation model. It is initialized by the VGG-16 [19] pretrained on the ImageNet. We use the ‘poly’ learning rate policy. The base learning rate is set to 0.001 and the power is set to 0.9. The number of the iterations is set to 10000 and there is no data augmentation strategy using in our experiments. In addition, the unknown labels are regarded as obstacles to provide a fair comparison with existing methods.

The proposed two branches terrain segmentation network can run in real time. The average speed is about 14 frames per second with one NVIDIA Titan X graphic card. The method still has room for speed improvement by optimizing the network structure or upgrading the hardware equipments.

Fig. 3.
figure 3

Representative images from each of the datasets

3.2 Results and Discussion

The performance metric used in this evaluation is the root mean square error (RMSE), where lower scores are better.

$$\begin{aligned} RMSE=\sqrt{\frac{1}{N}\sum _{i=1}^N(p_i-l_i)^2} \end{aligned}$$
(6)

where \(p_1, p_2, \ldots , p_N\) are the predictions on [0, 1] for a set of N test points and \(l_1, l_2, \ldots , l_N\) are the corresponding class labels in \(\{0,1\}\). Used in this manner, RMSE measures the error between the predicted terrain class \(p_i\) as output of our network, which is probabilistic, and the actual class label \(l_i\) determined by a human, which is discrete.

The terrain segmentation main network operates on color images, disparity images and ground plane deviation images respectively, recorded as baseline-RGB, baseline-Disparity and baseline-Deviation. We can observe that the proposed terrain segmentation method is more stable (by analyzing Table 1 horizontally) and accurate (by analyzing Table 1 vertically) than our baselines. It’s because we calculate the ground plane deviation from the disparity, whose stable distributions on different datasets and certain plane geometry property can help to identify safe areas in the situation where the appearance of the scene changes greatly. Experiments show that using the disparity map and the ground plane deviation map as network inputs instead of the color image is effective. Contrastively, when we use a combination of RGB and deviation, the root mean square error is increased by \(0.3\%\).

Table 1. Comparison of terrain segmentation results. We use the terrain segmentation main network (only segmentation module) with the multiple sampling rates filters operates on different inputs as our baselines.

Table 2 shows the comparison of terrain segmentation results. Different from the training strategy using all six datasets proposed by Wei et al. [21] and others, we just choose the first dataset DS1A for training and evaluate the performance over all six datasets. It is important to note that DS1A is a small-scale training set and differs significantly from the testing sets which have diverse natural and road conditions. Procopio et al. [18] use sample balance methods and train the model on the near field data of the image, which does not suffer from the scene change problem. Their method performs well on DS3 (0.104, 0.139), but is not suitable for other datasets, especially on DS2B (0.676). We achieve a precision of 0.066 in a situation where the scene remains the same. Overall, the proposed two branches terrain segmentation network based on disparity map and ground plane fitting is more stable and accurate than others by introducing geometric characteristics into the network and generating reference feature maps adaptively.

Table 2. Comparison of terrain segmentation results. Different from the training strategy using all six datasets proposed by others, we just choose the first dataset DS1A for training and evaluate the performance over all six datasets.
Table 3. Comparison results of multiple filters
Fig. 4.
figure 4

Overlapping maps (Green denotes the traversable areas.) of the result images with raw images (Color figure online)

Fig. 5.
figure 5

Comparison of terrain segmentation results. The test images come from different data sets (from left to right, they come from DS1B, DS2A, DS2B, DS3A). Green and gray denote the traversable areas and obstacles respectively. The penultimate row shows our results which demonstrate excellent unstructured terrain segmentation performance. (Color figure online)

Besides, we use the multiple sampling rates filters to incorporate larger context and smooth the holey information, offering a 1–\(2\%\) decrease in RMSE (see Tabel 3 for details).

As shown in Fig. 5, the qualitative results showing significant visual improvement compared with other methods. For test data from different datasets, our results are closest to the ground truth, and have better smoothness and continuity than others. We can observe that our method performed better than others in most cases under changing lighting conditions, weather, and variable road conditions etc. even though we only use a small amount of data DS1A for training. To be easier to see how well a ground plane segmentation has been achieved, we show overlapping maps of the result images with raw images in Fig. 4.

4 Conclusion

In this paper, we propose a novel two branches terrain segmentation network based on disparity map and ground plane fitting in order to produce accurate terrain segmentation results under different illumination, weather and road conditions. It employs the characteristics of deep convolution neural network and the geometric property of disparity map. It not only can accurately distinguish the ground areas and obstacles, but also has better regional consistency and smoothness. Experimental results demonstrate excellent terrain segmentation performance in variable and challenging scenes.