Fusion of 3D-LIDAR and camera data for scene parsing

doi:10.1016/j.jvcir.2013.06.008

Journal of Visual Communication and Image Representation

Volume 25, Issue 1, January 2014, Pages 165-183

https://doi.org/10.1016/j.jvcir.2013.06.008 Get rights and content

Highlights

•
One geometry segmentation algorithm is proposed to parse scanner pointclouds.
•
One efficient multilayer perception classifier is trained to parse camera images.
•
We propose one fuzzy logic based fusion method to integrate results of two sensors.
•
We propose one Markov random field based temporal fusion method.
•
The fused results are more reliable than those of individual sensors.

Abstract

Fusion of information gathered from multiple sources is essential to build a comprehensive situation picture for autonomous ground vehicles. In this paper, an approach which performs scene parsing and data fusion for a 3D-LIDAR scanner (Velodyne HDL-64E) and a video camera is described. First of all, a geometry segmentation algorithm is proposed for detection of obstacles and ground areas from data collected by the Velodyne scanner. Then, corresponding image collected by the video camera is classified patch by patch into more detailed categories. After that, parsing result of each frame is obtained by fusing result of Velodyne data and that of image using the fuzzy logic inference framework. Finally, parsing results of consecutive frames are smoothed by the Markov random field based temporal fusion method. The proposed approach has been evaluated with datasets collected by our autonomous ground vehicle testbed in both rural and urban areas. The fused results are more reliable than that acquired via analysis of only images or Velodyne data.

Introduction

Autonomous situation awareness is an important research aspect for robots and unmanned vehicles. Besides whether the terrain is traversable, they also require more specific object category information to carry out their tasks: e.g., approaching a tree, or the water area. For decades, computer vision approaches have been studied to classify scenes from images. Studies of the human visual system show us that scene perception is a highly complex process of information fusion which involves not just the human eyes, but also other human senses including hearing, tasting, etc. Even within a human vision system, there is clearly fusion of information from color, motion, depth and a whole variety of ways to infer shape, movement and physical characteristics of the things within the view [1]. In other words, efficient perceptual performance often requires integration of multiple sources of information, both within the senses and between them. As a matter of fact, other sensors like infrared laser projector in Kinect [2] and LIDAR scanners [3] have been applied to complement video cameras in recent years.

In this work, in order to help unmanned vehicles to understand their environment, two sensors are used: Velodyne HDL-64E 3D-LIDAR scanner [3] and monocular video camera. A Velodyne scanner provides 3-dimensional but sparse pointcloud of the surrounding environment. The pointcloud is trustworthy for obstacle detection but lacks color and texture information, which is valuable for more detailed categorization of objects. Besides, although Velodyne HDL-64E is a powerful LIDAR scanner in the market, its effective coverage limits within 70 m from the center of the sensor. Considering some time will be taken for information processing and task scheduling, the 70 m distance may not be sufficient for an unmanned moving vehicle to respond. Furthermore, for some tasks, we hope the vehicle can “see” as far as 200 m for advanced planning. On the contrary, images captured by video cameras can easily cover a much broader and further area and provide more discriminative information to classify objects into categories. However, due to the lack of depth information, image-based detection of obstacles of various shapes, sizes and orientations remains challenging. Due to the above-mentioned complementary features between cameras and LIDAR sensors, it is possible to acquire more reliable scene parsing by fusing information derived from these two sensors.

In addition, the sequential scene parsing also requires fusing results of consecutive frames. In fact, even after fusing results of two sensors, the obtained parsing results of consecutive frames may have abrupt changes due to stochastic errors. These abrupt changes of parsing results may confuse the vehicle navigation system. Intuitively, it is possible to obtain more cohesive sequential parsing results by including the temporal fusion.

In this research, we first propose a new way to fuse the results of two sensors by employing fuzzy logic inference [4]. Then we propose a Markov random field based approach to fuse the results of consecutive frames. Fig. 1 illustrates the fusion process. Fuzzy logic is preferable for our application due to its advantages. First, fuzzy logic is built on top of the knowledge and experience of experts. Therefore, it can employ not only results from LIDAR and video camera data but also a priori knowledge. Second, fuzzy logic can model nonlinear functions of arbitrary complexity. This is important as scene parsing is not a trivial problem. Third, fuzzy logic can tolerate imprecise results of two sensors. Moreover, fuzzy logic is a flexible fusion framework so that results of more sensors can be easily integrated to the system in future.

To fuse results of consecutive frames, we propose a Markov random field (MRF) based temporal fusion method [5], [6]. Correspondences between consecutive frames are first estimated by using the dense optical flow method [7]. Then, a MRF model is built to integrate results of multiple consecutive frames. The result of each frame is refined by the Belief Propagation (BP) algorithm [8]. The following contributions have been made in this paper:

1.
To the best of our knowledge, the proposed approach is the first systematic fuzzy logic inference based fusion work for scene understanding by fusing results of Velodyne 3D-LIDAR scanner and monocular video camera.
2.
The MRF based temporal fusion method is introduced to obtain cohesive video parsing results. It can smooth whole frame simultaneously by integrating results of multiple consecutive frames.
3.
We test the proposed approach on datasets collected by our autonomous ground vehicle testbed. The datasets are captured from urban and rural areas either in day or night time. The results validate the robustness and effectiveness of our fusion approach for scene parsing.

A preliminary version of this paper was described in [9]. The current version described here differs from the former in several ways, including: the introduction of MRF based temporal fusion method; comprehensive evaluation of the method with three more datasets; further analysis and discussion of the whole approach, as well as the introduction of more related works about sensor fusion and scene parsing. While the preliminary version in [9] focuses on fuzzy logic based fusion strategies, the current version will provide more details on image parsing techniques, too.

This paper is organized as follows: In Section 2, we briefly survey the sensor fusion and scene parsing literature. After giving the parsing methods for individual sensors, we describe the fuzzy logic based method to fuse the results of two sensors in Section 3.1. The MRF based temporal fusion is presented in Section 3.1. Thorough experiments are conducted in Section 3.1 for evaluation, and in-depth discussion is provided in Section 3.1. We conclude our paper in Section 3.1.

Section snippets

Related work

By combining data from multiple sensors, we can achieve improved accuracies and more specific inferences than that achieved by the use of a single sensor alone [10]. The existing methods for fusing LIDAR data and camera images can be grouped into two categories: centralized approaches, decentralized approaches. In centralized approaches, the fusion process occurs at the pixel-level or feature level, i.e., features from both LIDAR and video camera are combined in a single vector for posterior

Parsing modules for individual sensors

As a decentralized fusion method, a geometry segmentation algorithm is proposed to detect obstacles and ground from Velodyne data for this work. In the meantime, one algorithm, which combines both bottom-up and top-down analyses, is designed to classify image patches into multiple categories. In this section, we first describe the two detection algorithms separately and then summarize their advantages and disadvantages.

Fuzzy logic based sensor fusion

Both the results of laser scanner and the results of camera image have their own advantages and disadvantages. To parse the scene correctly, the primary work of fusion is to categorize the detected candidate obstacles by Velodyne scanner. The scene parsing results are then improved based on the categorization. As a good way to utilize the a priori knowledge and experience of human experts [4], we propose to use the fuzzy inference method to fuse the results of two sensors.

Temporal fusion of consecutive frames

By fusing the results of camera and scanner, we can have a better parsing result of each frame. The image parsing result is helpful to understand the environment for the ground vehicle. However, the results of consecutive frames may have abrupt changes due to the car moving, partial occlusion, etc. Fig. 8 shows this phenomenon and several incohesive regions are marked in one frame by the white circles. These abrupt changes of parsing results will mislead the vehicle navigation system. One major

Performance evaluation

To evaluate our fusion approach, we test it on four datasets collected by our autonomous ground vehicle testbed when driving in rural and urban areas and one public pedestrian dataset [54]. In the experiments, we compare the fusion result with that of using video camera only. In addition, the MRF based temporal fusion method is further evaluated.

Selection of image classifier

An MLP (multilayer perceptron) classifier has been finally chosen to parse the image superpixels due to its lower computational cost than other classifiers like kernel support vector machine (SVM) or structured learning approaches like conditional random field (CRF) [32]. According to our experiments, the linear SVM does not work in our case. The non-linear SVM with RBF kernel could achieve comparable F-measure with MLP. However, the non-linear SVM runs much slower than the MLP as the number of

Conclusions

In this paper, we present a sensor fusion method for scene parsing using laser scanner and video camera. By employing fuzzy logic inference, our method can incorporate not only results of two sensors, but also the human experience and knowledge. To smooth parsing results of consecutive frames, we further propose a Markov random field based temporal fusion method. The proposed approach has been evaluated with five datasets. Four of them are collected by our autonomous ground vehicle testbed in

Acknowledgments

This work is supported in part by Nanyang Assistant Professorship SUG M4080134, JSPS-NTU joint project M4080882, NTU CoE seed grant M4081039, and NTU-DSO joint project M4060969.

References (61)

L. Zadeh
Fuzzy sets
Inform. Control
(1965)
E.H. Mamdani et al.
An experiment in linguistic synthesis with a fuzzy logic controller
Int. J. Hum. Comput. Stud.
(1999)
D. Burr et al.
Combining vision with audition and touch, in adults and in children
Sens. Cue Integr.
(2011)
Kinect, MicroSoft, Inc., Dec, 2012,...
Velodyne, Lidar, Inc., Hdl-64e, Dec, 2012,...
C. Liu et al.
Nonparametric scene parsing via label transfer
IEEE Trans. Pattern Anal. Mach. Intell.
(2011)
J. Shotton et al.
Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context
Int. J. Comput. Vision
(2009)
T. Brox et al.
High accuracy optical flow estimation based on a theory for warping
Comput. Vision-ECCV
(2004)
P.F. Felzenszwalb et al.
Efficient belief propagation for early vision
Int. J. Comput. Vision
(2006)
G. Zhao, X. Xiao, J. Yuan, Fusion of velodyne and camera data for scene parsing, in: 15th International Conference on...

D. Hall et al.

An introduction to multisensor data fusion

Proc. IEEE

(1997)

B. Douillard, A. Brooks, F. Ramos, A 3d laser and vision based classifier, in: Proceedings of the Fifth International...

M. Häselich, M. Arends, D. Lang, D. Paulus, Terrain classification with markov random fields on fused camera and 3d...

S. Laible, Y.N. Khan, K. Bohlmann, A. Zell, 3d lidar- and camera-based terrain classification under different lighting...

N. Kaempchen et al.

Feature-level fusion for free-form object tracking using laserscanner and video

S. Schneider, M. Himmelsbach, T. Luettel, H.-J. Wnsche, Fusing vision and lidar – synchronization, correction and...

K. Kidono, T. Naito, J. Miura, Reliable pedestrian recognition combining high-definition lidar and vision data, in:...

M. Himmelsbach et al.

Autonomous off-road navigation for mucar-3 – improving the tentacles approach: integral structures for sensing and motion

KI

(2011)

R. Labayrade et al.

Cooperative fusion for multi-obstacles detection with use of stereovision and laser scanner

Auton. Rob.

(2005)

C. Premebida et al.

Lidar and vision-based pedestrian detection system

J. Field Rob.

(2009)

F. Garcia, D. Olmeda, Hybrid fusion scheme for pedestrian detection based on laser scanner and far infrared camera, in:...

W. Tang, K.Z. Mao, L.O. Mak, G.W. Ng, Z. Sun, J.H. Ang, G. Lim, Target classification using knowledge-based...

B.K. Habtemariam, R. Tharmarasa, T. Kirubarajan, D. Grimmett, C. Wakayama, Multiple detection probabilistic data...

S. Martin, Sequential bayesian inference models for multiple object classification, in: Proceedings of the 14th...

R. Matthaei, H. Dyckmanns, Motion classification for cross traffic in urban environments using laser and radar, in:...

D. Batra et al.

Learning class-specific affinities for image labelling

C. Galleguillos et al.

Object categorization using co-occurrence, location and appearance

L. Yang, P. Meer, D.J. Foran, Multiple class segmentation using a unified framework over mean-shift patches, in: IEEE...

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, S. Belongie, Objects in context, in: ICCV, 2007, pp....

C. Pantofaru, C. Schmid, M. Hebert, Object recognition by integrating multiple image segmentations, in: ECCV (3), 2008,...

Cited by (56)

Reducing unnecessary alerts in pedestrian protection systems based on P2V communications
2024, arXiv
An overview of developments and challenges for unmanned surface vehicle autonomous berthing
2024, Complex and Intelligent Systems
Multi-Sensor Fusion and Cooperative Perception for Autonomous Driving: A Review
2023, IEEE Intelligent Transportation Systems Magazine
Triangle-Mesh-Rasterization-Projection (TMRP): An Algorithm to Project a Point Cloud onto a Consistent, Dense and Accurate 2D Raster Image
2023, Sensors
Fully convolutional neural networks for LIDAR–camera fusion for pedestrian detection in autonomous vehicle
2023, Multimedia Tools and Applications
Model-based and machine learning-based high-level controller for autonomous vehicle navigation: lane centering and obstacles avoidance
2023, IAES International Journal of Robotics and Automation

View all citing articles on Scopus

View full text

Fusion of 3D-LIDAR and camera data for scene parsing

Highlights

Abstract

Introduction

Section snippets

Related work

Parsing modules for individual sensors

Fuzzy logic based sensor fusion

Temporal fusion of consecutive frames

Performance evaluation

Selection of image classifier

Conclusions

Acknowledgments

Inform. Control

Int. J. Hum. Comput. Stud.

Combining vision with audition and touch, in adults and in children

Sens. Cue Integr.

Nonparametric scene parsing via label transfer

IEEE Trans. Pattern Anal. Mach. Intell.

Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context

Int. J. Comput. Vision

High accuracy optical flow estimation based on a theory for warping

Comput. Vision-ECCV

Efficient belief propagation for early vision