1 Introduction

Navigation in highly unstructured environments like trails in forests or mountains is extremely challenging for autonomous robots due to the numerous variations present in natural environment, and the absence of structured pathways or distinct lane markings. A robot capable of autonomously navigating off-road environments would become invaluable aid in search-and-rescue missions, wilderness monitoring and mapping etc.

The resurgence of neural networks in the form of “deep” neural networks (DNNs) [1] has improved the performance in various high-level computer vision tasks [2,3,4,5]. DNNs have also been successfully applied for road and lane detection in highways, and structured urban settings [6,7,8]. The task of trail detection is related to the task of road (lane) detection however the two are fundamentally different. Unlike structure urban road, natural environment has infinite variations and the absence of structured pathways or distinct lane markings makes the problem of trail detection more challenging.

Off-road trail detection has been primarily approached as a segmentation problem [9,10,11]using classical computer vision techniques. Neural networks(NNs) have also been used for autonomous navigation in unstructured natural environments [12,13,14,15]. Hadsell et al. [12] used a self-supervised NN in conjunction with a stereo module to classify the terrain in front of the robot as ground or obstacle. Guisti et al. [13] and Smolyanskiy et al. [14] used DNN as a supervised classifier to output the heading direction of a trail compared to the viewing direction of a quadrotor. Both of these works predict the instantaneous heading direction of the trail. However, the trail itself is not detected and localized. Adhikari et al. [15] used a segment-then-detect approach for trail detection. A patch based DNN was used to segment the trail and dynamic programming was used as a post-processing step to detect and localize the trail line. However, due the ambiguity between trail and non-trail patches, the accuracy of the patch based segmentation is low which affects the accuracy of the detected trail.

Fig. 1.
figure 1

The proposed method of learning semantic lines for forest trail detection using convolutional neural networks.

In this work we propose an end-to-end learning system for forest trail detection using semantic lines, as shown in Fig. 1. Unlike the segment-then-detect approach, semantic line is used to annotate the trail and a fully convolutional neural network is used for learning the semantic lines. Moreover, we propose a novel distance weighted loss function for training the fully convolutional neural network. The proposed loss function focuses the attention of the network on the trail by penalizing low activations around the ground truth and high activations in areas further away from the trail. The proposed loss function is shown to produce more accurate forest trails compared to other commonly used loss functions.

2 Semantic Line for Trail Detection

One of the factors enabling the rapid adoption and increased performance of deep neural networks is the availability of huge amount of data for training. However, for supervised training of DNNs this data has to be annotated manually with ground truth. It is expensive and time consuming to prepare large scale ground truth annotations. Manual annotation is particularly time consuming for semantic segmentation where per-pixel annotation is required. For tasks like forest trail detection, the cluttered background with amorphous “stuffs”, and ambiguous boundaries without lane markings makes it difficult even for humans to annotate correctly.

Fig. 2.
figure 2

Annotating forest trail with bounding box is impractical and dense annotation for trail segmentation is difficult as the trail does not have defined shape or appearance, and at times merges seamlessly with the surrounding environment. However, it is simple and meaningful to annotate with a semantic line representing the concept of a forest trail.

However, humans are good at understanding visual scenes and communicating ideas pertaining to the scene efficiently using graphical representation. For example, one of the meaningful ways to represent a “navigable” path on a wide forest-trail is by drawing a line along the “shortest obstacle-free” path, as shown in Fig. 2. The trail in the image is not marked, have no defined shape or appearance, and the boundaries are ambiguous as the trail merges seamlessly with the surrounding environment. Unlike the per-pixel annotations required for semantic segmentation, the trail is annotated by drawing few pixel thick lines across the length of the visible trail. Such a simple line, henceforth referred as semantic line, can also encode higher-level concepts about obstacle avoidance and path planning required for navigating forest trails. This graphical representation of the trail line is intuitive and can be annotated easily even for complex natural environments.

Structured road lanes have also been annotated with lines [17] for lane detection using deep learning and can be considered as semantic lines. However, there is a fundament difference between the semantic lines representing road lanes and natural trails. Urban road lanes are clearly guided by distinct lane markings present on the road and are not subjective. However, no such guiding marks are available for natural trail and the annotated trail is often subjective [15] making them more challenging. In this work we use the IDSIA forest trail dataset [16] to demonstrate the effectiveness of the proposed approach.

3 Distance Weighted Loss Function

The commonly used loss function for learning DNNs with 2D outputs like semantic segmentation are the categorical cross-entropy loss (CCE), mean squared error (MSE) and Jaccard coefficient (JC). To mitigate the effect of class-imbalance in the training data, these losses are generally weighted using frequency based techniques. The weights assigned to the different classes are inversely proportional to their frequency of occurrence in the training set. Hence, the dominant classes are assigned lower weights compared to rarely occurring classes so as to give equal importance to all the classes while training. Though effective for mitigating the class imbalance problem, the high weights assigned to rare labels produce noisy gradients and may lead to unstable optimization.

The semantic line annotation used in this study results in highly imbalanced classes with a strong bias towards the background class. To mitigate the effect of the unbalanced data, we propose a novel loss function given by,

$$\begin{aligned} {L_{DW}} = \frac{1}{{N \times M}}\sum \limits _{j = 1}^N {\sum \limits _{i = 1}^M {\left\{ {(1 - {g_{ij}}) \cdot d(i,j) \cdot {p_{ij}}\, - {g_{ij}} \cdot \,{d_{\max }} \cdot \log ({p_{ij}})} \right\} } } \end{aligned}$$
(1)

where \({g_{ij}} \in \{ 0,1\} \) is the ground truth label where background = 0 and trail = 1, M \(\times \) N is the spatial dimension of the target, \({p_{ij}} \in [0,1]\) is the result of sigmoid non-linearity applied to the output layer and \(d(i,j)\) is the value of the distance map at location (i, j) and \({d_{\max }}\) is the maximum value in the map. Instead of the frequency based weighting, the proposed loss function, Eq. (1), uses a distance map based weighting. The distance map is generated by computing the distance transform on the ground truth annotation with the semantic line as foreground. The pixels far away from the semantic line are weighted more compared to those in the proximity of the line.

The first term in Eq. (1) has effect only on the background pixels, whereas the second term affects only the foreground pixels. In the first term we multiply \({p_{ij}}\) with \(d(i,j)\) to penalize high activations in areas of the background. The loss term effectively penalizes trail pixels which should not be there. The faraway background pixels are penalized more than the nearby background pixels. The second term is the log-likelihood of the foreground pixel. This term penalizes low activations and encourages high activations near the ground truth. This term is weighted by \({d_{\max }}\) to bring both parts the loss function to similar scales.

4 Learning Semantic Lines Using Encoder-Decoder Network

Convolutional encode-decoder networks have been used successfully for various computer vision tasks that require a 2D output [3,4,5]. The encoder-decoder architecture is adopted in this study for learning of forest trails using semantic lines, as shown in Fig. 1. A U-Net [18] like architecture is used in this study and the details of the network are given in Table 1.

Table 1. The architecture of the encoder decoder network for detecting forest trail using semantic lines.
Fig. 3.
figure 3

a. VGG style block used in the encoder, b. ResNet-style block used in the decoder of the network.

The encoder consists of VGG [19] style blocks where three consecutive 3 \(\times \) 3 convolutions are followed by 2 \(\times \) 2 max pooling to successively reduce the spatial resolution of the feature maps. In the VGG block, the feature maps after each convolution are batch normalized and non-linearly transformed using the ReLU activation as shown in Fig. 3(a). The decoder follows the encoder in U-Net like fashion. Each block in the decoder consists of up-sampling using transposed convolution followed by a ResNet [20] like de-convolution (transposed convolution) block, as shown in Fig. 3(b). The output is then concatenated with the corresponding feature maps (before max-pooling) from the encoder. The final layer is a convolutional layer that maps the input to a single channel output. Sigmoid non-linearity, Eq. (2), is then applied to output a saliency map indicating the confidence of each pixel belonging to the trail line. The network is then trained from scratch using the proposed loss function of Eq. (1).

$$\begin{aligned} {\tilde{y}_{ij}} = p({x_{ij}}) = \frac{1}{{(1 + \exp ( - {x_{ij}}))}} \end{aligned}$$
(2)

5 Experiments and Results

We evaluate the proposed method on a subset of the IDSIA forest trail dataset. The IDSIA forest trail dataset contains images of natural forest trail captured by three head-mounted cameras oriented in different directions. Image level annotations for instantaneous heading direction of the trail are implicit for this dataset. However, our intended target is not to find the instantaneous heading direction, but to detect the local segment of the trail visible in the image. Only the images captured from the front facing camera of folders 001 and 002 were used as the forest trail was visible only from the front facing camera and the images of folders 001 and 002 were captured under identical settings. A total of 3835 images were extracted and the trail was annotated with semantic lines and the corresponding distance maps were computed. 2860 images were used for training, 472 for validation and 503 images were set aside for testing. The training and test images were taken from different sections of the trail without overlap. The high resolution images were down sampled to 232 \(\times \) 360 pixels to reduce computation and memory requirements. The training data was augmented with vertical mirrors, and random crops of size 192 \(\times \) 320 were generated during runtime.

The network was trained from scratch with the network parameters initialized using Xavier initialization. We train the network with a batch size of 12 using Adam optimizer with initial learning rate of \(1e^-4\) for a total of 40 epochs. The learning rate was reduced to half after 20 epochs. At the end of each epoch the model was evaluated on the validation set using Eq. (1), and the model with the lowest validation error was selected as the final model for test. The experiments were carried out using Tensorflow on a machine with NVIDIA Titan X GPU.

The mean intersection over union (mIoU) given by

$$\begin{aligned} IoU = \frac{{G \cap P}}{{G \cup P}} \end{aligned}$$
(3)

is chosen as the evaluation metric on the test set, G is the ground truth mask and P is the binary prediction mask obtained after thresholding (\(\tau = 0.5\)) the output of the network. The performance of the proposed loss is compared to the commonly used CCE, MSE and JC. The target labels are one-hot encoded to compute CCE, MSE and JC, hence the number of channels at the output is two for these losses. Softmax was applied to these feature maps for scaling the output. To reduce the effect of class-imbalance in the training data, weights based on the inverse of class frequency is used for CCE and MSE. We report the performance by training the network five times with different random seeds. The final result is computed as the average of the prediction on the original and its mirror image. The mIoU performance of the different losses is presented in Table 2. From Table 2 we see that the proposed method leads to more accurate trails than CCE, JC and MSE. Unlike the other loss functions where the final test accuracy varies widely with the initialization condition, the proposed loss function leads to consistent results irrespective of the initialization condition.

Table 2. Performance of the proposed distance weighted loss function on the test set.

In Fig. 4, we also plot the empirical cumulative distribution of the mIoU for a set of models trained under similar settings using the different loss functions. From Fig. 4 we can see that under a weak evaluation setting (the trail is said to be positively detected if mIoU > 0.3) CCE performs better than MSE and JC. But, under a strong evaluation setting (trail is said to be positively detected if mIoU > 0.5) the performance of CCE is way below that of MSE and JC. However, the proposed loss function shows superior performance compared to all the other losses under both the weak and strong evaluation settings.

Fig. 4.
figure 4

Empirical cumulative distribution function of mIoU on the test set obtained by model trained using different loss functions. (cce: cross entropy, jc: Jaccard coefficient, mse: mean squared error, dw: distance weighted (proposed) loss).

Fig. 5.
figure 5

Some qualitative results of trail detection using the proposed method. The test image, image with ground truth overlapped and the prediction overlapped are presented in the first, second and third column, respectively. (a–e) Results where the mIoU > 0.6, (f–j) failure cases where the mIoU < 0.3.

Some qualitative results of end-to-end trail detection using the proposed method are presented in Fig. 5. Some examples where the mIoU between the ground truth and the predicted trail is greater than 60% is shown in Fig. 5(a) to (e). Some failure cases of the proposed method where the mIoU is less than 30% is presented in Fig. 5(f) to (j). We can observe that the network outputs thicker trail lines then the ground truth. This is in part due to low distance based penalty assigned to background pixels in the vicinity of the ground truth and in part due to the architecture of the decoder. Simple post-processing step like erosion can be used to refine the output, however no postprocessing is done in the experiments.

In the absence of clear guiding landmarks or boundary, the process of annotating trails with semantic lines is subjective. Though we did not employ multiple annotators to estimate the effect of human subjectivity, from Fig. 5(h), (i) and (j) it can be observed that human subjectivity can have significant effect on the accuracy of the trained model. Though the trained model outputs valid “navigable” trails for these images, their mIoU with the ground truth is less than 30% rendering them as poor detections. While preparing the training data, the trails were annotated with a single line. Even the trails bifurcating into multiple paths were annotated with a line representing the dominant path. From Fig. 5(j), it can be observed that such trail bifurcations have to be handled in an appropriate manner.

6 Conclusion

A novel distance weighted loss function for end to end learning of unstructured forest trail using convolutional encoder-decoder network with semantic lines was proposed. Unlike the pixel wise labelling used in semantic segmentation, the forest trail was annotated with “semantic line”. A distance map derived from the ground truth was then used to weigh the loss function to guide the focus of the network on the forest trail. The proposed loss function produced more accurate and consistent trail predictions compared to other commonly used loss functions like the categorical cross-entropy, mean squared error and dice coefficient on the IDSIA forest trail dataset. Future work will include the development of techniques to output accurate and thinner trails and evaluation the proposed method on other applications and datasets.