Abstract
Semantic scene understanding is a useful capability for autonomous vehicles operating in off-roads. While cameras are the most common sensor used for semantic classification, the performance of methods using camera imagery may suffer when there is significant variation between the train and testing sets caused by illumination, weather, and seasonal variations. On the other hand, 3D information from active sensors such as LiDAR is comparatively invariant to these factors, which motivates us to investigate whether it can be used to improve performance in this scenario. In this paper, we propose a novel multimodal Convolutional Neural Network (CNN) architecture consisting of two streams, 2D and 3D, which are fused by projecting 3D features to image space to achieve a robust pixelwise semantic segmentation. We evaluate our proposed method in a novel off-road terrain classification benchmark, and show a 25% improvement in mean Intersection over Union (IoU) of navigation-related semantic classes, relative to an image-only baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For performance reasons, we simplify the point cloud network by replacing the dilation layer and asymmetric layer with the regular convolution layer. Also, we replace the deconvolution layer with the upsample layer followed by the \(3 \times 3 \times 3\) convolutional layer with stride 1. For simplicity, we use the same term “deconvolution”.
- 2.
Point cloud is represented by the 3D voxel grid as a convolutional architecture requires a regular input data format.
References
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional models for semantic segmentation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561 [cs.CV] (2015)
Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147 [cs.CV] (2016)
Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv:1301.3572 [cs.CV] (2013)
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Proceedings European Conference on Computer Vision (ECCV) (2014)
Valada, A., Oliveira, G.L., Brox, T., Burgard, W.: Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion. In: Proceedings International Symposium on Experimental Robotics (ISER) (2016)
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 35(8), 1915–1929 (2013)
Ladický, L., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? combining object detectors and CRFs. In: Proceedings European Conference on Computer Vision (ECCV) (2010)
Micusik, B., Košecká, J., Singh, G.: Semantic parsing of street scenes from video. Intl J. Rob. Res. (IJRR) 31(4), 484–497 (2012)
Xiao, J., Quan, L.: Multiple view semantic segmentation for street view images. In: Proceedings IEEE Intl Conference on Computer Vision (ICCV) (2009)
Simonyan, K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs.CV] (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 [cs.CV] (2015)
Munoz, D., Bagnell, J.A., Hebert, M.: Co-inference for multi-modal scene analysis. In: Proceedings European Conference on Computer Vision (ECCV) (2012)
Newman, P., et al.: Navigating, recognizing and describing urban spaces with vision and lasers. Intl J. Rob. Res. (IJRR) 28(11–12), 1406–1433 (2009)
Cadena, C., Košecká, J.: Semantic segmentation with heterogeneous sensor coverages. In: Proceedings IEEE Intl Conference on Robotics and Automation (ICRA) (2014)
Alvis, C.D., Ott, L., Ramos, F.: Urban scene segmentation with laser-constrained CRFs. In: Proceedings IEEE/RSJ Intl Conference on Intelligent Robots and Systems (IROS) (2016)
Gupta, S., Arbeláez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Maturana, D., Scherer, S.: 3D convolutional neural networks for landing zone detection from LiDAR. In: Proceedings IEEE Intl Conference on Robotics and Automation (ICRA) (2015)
Scherer, S., Chamberlain, L.J., Singh, S.: Online assessment of landing sites. In: Proceedings AIAA Infotech@Aerospace (2010)
Amanatides, J., Woo, A.: A fast voxel traversal algorithm for ray tracing. In: Proceedings Eurographics (1987)
Acknowledgements
We thank the Yamaha Motor corporation for supporting this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Kim, DK., Maturana, D., Uenoyama, M., Scherer, S. (2018). Season-Invariant Semantic Segmentation with a Deep Multimodal Network. In: Hutter, M., Siegwart, R. (eds) Field and Service Robotics. Springer Proceedings in Advanced Robotics, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-319-67361-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-67361-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67360-8
Online ISBN: 978-3-319-67361-5
eBook Packages: EngineeringEngineering (R0)