ABSTRACT
3D localization of objects in road scenes is important for autonomous driving and advanced driver-assistance systems (ADAS). However, with common monocular camera setups, 3D information is difficult to obtain. In this paper, we propose a novel and robust method for 3D localization of monocular visual objects in road scenes by joint integration of depth estimation, ground plane estimation, and multi-object tracking techniques. Firstly, an object depth estimation method with depth confidence is proposed by utilizing the monocular depthmap from a CNN. Secondly, an adaptive ground plane estimation using both dense and sparse features is proposed to localize the objects when their depth estimation is not reliable. Thirdly, temporal information is taken into consideration by a new object tracklet smoothing method. Unlike most existing methods which only consider vehicle localization, our method is applicable for common moving objects in the road scenes, including pedestrians, vehicles, cyclists, etc. Moreover, the input depthmap can be replaced by some equivalent depth information from other sensors, like LiDAR, depth camera and Radar, which makes our system much more competitive compared with other object localization methods. As evaluated on KITTI dataset, our method achieves favorable performance on 3D localization of both pedestrians and vehicles when compared with the state-of-the-art vehicle localization methods, though no published performance on pedestrian 3D localization can be compared with, from the best of our knowledge.
- Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence , Vol. 34, 11 (2012), 2274--2282.Google ScholarDigital Library
- Junaid Ahmed Ansari, Sarthak Sharma, Anshuman Majumdar, J Krishna Murthy, and K Madhava Krishna. 2018. The Earth Ain't Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 8404--8410.Google ScholarDigital Library
- Frédéric Chausse, Romuald Aufrère, and Roland Chapuis. 2000. Recovering the 3D shape of a road by on-board monocular vision. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000 , Vol. 1. IEEE, 325--328.Google ScholarCross Ref
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence , Vol. 40, 4 (2018), 834--848.Google Scholar
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition . 3213--3223.Google ScholarCross Ref
- Ralf Dragon and Luc Van Gool. 2014. Ground plane estimation using a hidden markov model. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4026--4033.Google ScholarDigital Library
- David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.Google ScholarDigital Library
- David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems. 2366--2374.Google Scholar
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision. 3038--3046.Google ScholarCross Ref
- Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision. Springer, 740--756.Google ScholarCross Ref
- Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research , Vol. 32, 11 (2013), 1231--1237.Google ScholarDigital Library
- Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 270--279.Google ScholarCross Ref
- Osian Haines and Andrew Calway. 2015. Recognising planes in a single image. IEEE transactions on pattern analysis and machine intelligence , Vol. 37, 9 (2015), 1849--1861.Google ScholarDigital Library
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarCross Ref
- Tsung-Wei Huang, Jenq-Neng Hwang, Suzanne Romain, and Farron Wallace. 2018. Fish Tracking and Segmentation from Stereo Videos on the Wild Sea Surface for Electronic Monitoring of Rail Fishing. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google Scholar
- Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et almbox. 2018. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology , Vol. 28, 10 (2018), 2896--2907.Google ScholarDigital Library
- Hyungjin Kim, Bingbing Liu, and Hyun Myung. 2017. Road-feature extraction using point cloud and 3D LiDAR sensor for vehicle localization. In 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI). IEEE, 891--892.Google ScholarCross Ref
- Kuan-Hui Lee, Jenq-Neng Hwang, Greg Okopal, and James Pitton. 2016. Ground-moving-platform-based human tracking using visual SLAM and constrained multiple kernels. IEEE Transactions on Intelligent Transportation Systems , Vol. 17, 12 (2016), 3602--3612.Google ScholarDigital Library
- Tao Liu, Yong Liu, Zheng Tang, and Jenq-Neng Hwang. 2017. Adaptive ground plane estimation for moving camera-based 3D object tracking. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarCross Ref
- Yunze Man, Xinshuo Weng, and Kris Kitani. 2018. GroundNet: Segmentation-Aware Monocular Ground Plane Estimation with Geometric Consistency. arXiv preprint arXiv:1811.07222 (2018).Google Scholar
- Matthew W McDaniel, Takayuki Nishihata, Christopher A Brooks, and Karl Iagnemma. 2010. Ground plane identification using LIDAR in forested environments. In 2010 IEEE International Conference on Robotics and Automation. IEEE, 3831--3836.Google ScholarCross Ref
- Anton Milan, Konrad Schindler, and Stefan Roth. 2016. Multi-target tracking by discrete-continuous energy minimization. IEEE transactions on pattern analysis and machine intelligence , Vol. 38, 10 (2016), 2054--2068.Google ScholarDigital Library
- Michel Moukari, Sylvaine Picard, L Simoni, and Frédéric Jurie. 2018. Deep multi-scale architectures for monocular depth estimation. In 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2940--2944.Google ScholarCross Ref
- Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 2017. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7074--7082.Google ScholarCross Ref
- Faisal Mufti, Robert Mahony, and Jochen Heinzmann. 2012. Robust estimation of planar surfaces using spatio-temporal RANSAC for applications in autonomous vehicle navigation. Robotics and Autonomous Systems , Vol. 60, 1 (2012), 16--28.Google ScholarDigital Library
- J Krishna Murthy, GV Sai Krishna, Falak Chhaya, and K Madhava Krishna. 2017a. Reconstructing vehicles from a single image: Shape priors for road scene understanding. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 724--731.Google ScholarDigital Library
- J Krishna Murthy, Sarthak Sharma, and K Madhava Krishna. 2017b. Shape priors for real-time monocular object localization in dynamic environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 1768--1774.Google ScholarCross Ref
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google ScholarCross Ref
- Akshay Rangesh and Mohan M Trivedi. 2018. Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road. arXiv preprint arXiv:1811.06666 (2018).Google Scholar
- Ergys Ristani and Carlo Tomasi. 2018. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6036--6046.Google ScholarCross Ref
- Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2009. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence , Vol. 31, 5 (2009), 824--840.Google Scholar
- Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104--4113.Google ScholarCross Ref
- Stephen Se and Michael Brady. 2002. Ground plane estimation, error analysis and applications. Robotics and Autonomous systems , Vol. 39, 2 (2002), 59--71.Google Scholar
- Shiyu Song and Manmohan Chandraker. 2014. Robust scale estimation in real-time monocular SFM for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1566--1573.Google ScholarDigital Library
- Shiyu Song and Manmohan Chandraker. 2015. Joint SFM and detection cues for monocular 3D localization in road scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3734--3742.Google ScholarCross Ref
- Siyu Tang, Bjoern Andres, Miykhaylo Andriluka, and Bernt Schiele. 2015. Subgraph decomposition for multi-target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5033--5041.Google ScholarCross Ref
- Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. 2017. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3539--3548.Google ScholarCross Ref
- Zheng Tang, Gaoang Wang, Hao Xiao, Aotian Zheng, and Jenq-Neng Hwang. 2018. Single-camera and inter-camera vehicle tracking and 3D speed estimation based on fusion of visual and semantic features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 108--115.Google ScholarCross Ref
- Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu, and Jenq-Neng Hwang. 2018. Exploit the Connectivity: Multi-Object Tracking with TrackletNet. arXiv preprint arXiv:1811.07258 (2018).Google Scholar
- Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. 2017. Multi-target, multi-camera tracking by hierarchical clustering: recent progress on DukeMTMC Project. arXiv preprint arXiv:1712.09531 (2017).Google Scholar
Index Terms
- Monocular Visual Object 3D Localization in Road Scenes
Recommendations
Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving
CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern RecognitionScale drift is a crucial challenge for monocular autonomous driving to emulate the performance of stereo. This paper presents a real-time monocular SFM system that corrects for scale drift using a novel cue combination framework for ground plane ...
Predictive monocular odometry (PMO)
Visual odometry using only a monocular camera faces more algorithmic challenges than stereo odometry. We present a robust monocular visual odometry framework for automotive applications. An extended propagation-based tracking framework is proposed which ...
Monocular camera localization in 3D LiDAR maps
2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Localizing a camera in a given map is essential for vision-based navigation. In contrast to common methods for visual localization that use maps acquired with cameras, we propose a novel approach, which tracks the pose of monocular camera with respect to ...
Comments