skip to main content
10.1145/3343031.3350924acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Monocular Visual Object 3D Localization in Road Scenes

Published:15 October 2019Publication History

ABSTRACT

3D localization of objects in road scenes is important for autonomous driving and advanced driver-assistance systems (ADAS). However, with common monocular camera setups, 3D information is difficult to obtain. In this paper, we propose a novel and robust method for 3D localization of monocular visual objects in road scenes by joint integration of depth estimation, ground plane estimation, and multi-object tracking techniques. Firstly, an object depth estimation method with depth confidence is proposed by utilizing the monocular depthmap from a CNN. Secondly, an adaptive ground plane estimation using both dense and sparse features is proposed to localize the objects when their depth estimation is not reliable. Thirdly, temporal information is taken into consideration by a new object tracklet smoothing method. Unlike most existing methods which only consider vehicle localization, our method is applicable for common moving objects in the road scenes, including pedestrians, vehicles, cyclists, etc. Moreover, the input depthmap can be replaced by some equivalent depth information from other sensors, like LiDAR, depth camera and Radar, which makes our system much more competitive compared with other object localization methods. As evaluated on KITTI dataset, our method achieves favorable performance on 3D localization of both pedestrians and vehicles when compared with the state-of-the-art vehicle localization methods, though no published performance on pedestrian 3D localization can be compared with, from the best of our knowledge.

References

  1. Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence , Vol. 34, 11 (2012), 2274--2282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Junaid Ahmed Ansari, Sarthak Sharma, Anshuman Majumdar, J Krishna Murthy, and K Madhava Krishna. 2018. The Earth Ain't Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 8404--8410.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Frédéric Chausse, Romuald Aufrère, and Roland Chapuis. 2000. Recovering the 3D shape of a road by on-board monocular vision. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000 , Vol. 1. IEEE, 325--328.Google ScholarGoogle ScholarCross RefCross Ref
  4. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence , Vol. 40, 4 (2018), 834--848.Google ScholarGoogle Scholar
  5. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition . 3213--3223.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ralf Dragon and Luc Van Gool. 2014. Ground plane estimation using a hidden markov model. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4026--4033.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems. 2366--2374.Google ScholarGoogle Scholar
  9. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision. 3038--3046.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision. Springer, 740--756.Google ScholarGoogle ScholarCross RefCross Ref
  11. Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research , Vol. 32, 11 (2013), 1231--1237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 270--279.Google ScholarGoogle ScholarCross RefCross Ref
  13. Osian Haines and Andrew Calway. 2015. Recognising planes in a single image. IEEE transactions on pattern analysis and machine intelligence , Vol. 37, 9 (2015), 1849--1861.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarGoogle ScholarCross RefCross Ref
  15. Tsung-Wei Huang, Jenq-Neng Hwang, Suzanne Romain, and Farron Wallace. 2018. Fish Tracking and Segmentation from Stereo Videos on the Wild Sea Surface for Electronic Monitoring of Rail Fishing. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google ScholarGoogle Scholar
  16. Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et almbox. 2018. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology , Vol. 28, 10 (2018), 2896--2907.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hyungjin Kim, Bingbing Liu, and Hyun Myung. 2017. Road-feature extraction using point cloud and 3D LiDAR sensor for vehicle localization. In 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI). IEEE, 891--892.Google ScholarGoogle ScholarCross RefCross Ref
  18. Kuan-Hui Lee, Jenq-Neng Hwang, Greg Okopal, and James Pitton. 2016. Ground-moving-platform-based human tracking using visual SLAM and constrained multiple kernels. IEEE Transactions on Intelligent Transportation Systems , Vol. 17, 12 (2016), 3602--3612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tao Liu, Yong Liu, Zheng Tang, and Jenq-Neng Hwang. 2017. Adaptive ground plane estimation for moving camera-based 3D object tracking. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yunze Man, Xinshuo Weng, and Kris Kitani. 2018. GroundNet: Segmentation-Aware Monocular Ground Plane Estimation with Geometric Consistency. arXiv preprint arXiv:1811.07222 (2018).Google ScholarGoogle Scholar
  21. Matthew W McDaniel, Takayuki Nishihata, Christopher A Brooks, and Karl Iagnemma. 2010. Ground plane identification using LIDAR in forested environments. In 2010 IEEE International Conference on Robotics and Automation. IEEE, 3831--3836.Google ScholarGoogle ScholarCross RefCross Ref
  22. Anton Milan, Konrad Schindler, and Stefan Roth. 2016. Multi-target tracking by discrete-continuous energy minimization. IEEE transactions on pattern analysis and machine intelligence , Vol. 38, 10 (2016), 2054--2068.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Michel Moukari, Sylvaine Picard, L Simoni, and Frédéric Jurie. 2018. Deep multi-scale architectures for monocular depth estimation. In 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2940--2944.Google ScholarGoogle ScholarCross RefCross Ref
  24. Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 2017. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7074--7082.Google ScholarGoogle ScholarCross RefCross Ref
  25. Faisal Mufti, Robert Mahony, and Jochen Heinzmann. 2012. Robust estimation of planar surfaces using spatio-temporal RANSAC for applications in autonomous vehicle navigation. Robotics and Autonomous Systems , Vol. 60, 1 (2012), 16--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J Krishna Murthy, GV Sai Krishna, Falak Chhaya, and K Madhava Krishna. 2017a. Reconstructing vehicles from a single image: Shape priors for road scene understanding. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 724--731.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J Krishna Murthy, Sarthak Sharma, and K Madhava Krishna. 2017b. Shape priors for real-time monocular object localization in dynamic environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 1768--1774.Google ScholarGoogle ScholarCross RefCross Ref
  28. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google ScholarGoogle ScholarCross RefCross Ref
  29. Akshay Rangesh and Mohan M Trivedi. 2018. Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road. arXiv preprint arXiv:1811.06666 (2018).Google ScholarGoogle Scholar
  30. Ergys Ristani and Carlo Tomasi. 2018. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6036--6046.Google ScholarGoogle ScholarCross RefCross Ref
  31. Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2009. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence , Vol. 31, 5 (2009), 824--840.Google ScholarGoogle Scholar
  32. Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104--4113.Google ScholarGoogle ScholarCross RefCross Ref
  33. Stephen Se and Michael Brady. 2002. Ground plane estimation, error analysis and applications. Robotics and Autonomous systems , Vol. 39, 2 (2002), 59--71.Google ScholarGoogle Scholar
  34. Shiyu Song and Manmohan Chandraker. 2014. Robust scale estimation in real-time monocular SFM for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1566--1573.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shiyu Song and Manmohan Chandraker. 2015. Joint SFM and detection cues for monocular 3D localization in road scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3734--3742.Google ScholarGoogle ScholarCross RefCross Ref
  36. Siyu Tang, Bjoern Andres, Miykhaylo Andriluka, and Bernt Schiele. 2015. Subgraph decomposition for multi-target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5033--5041.Google ScholarGoogle ScholarCross RefCross Ref
  37. Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. 2017. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3539--3548.Google ScholarGoogle ScholarCross RefCross Ref
  38. Zheng Tang, Gaoang Wang, Hao Xiao, Aotian Zheng, and Jenq-Neng Hwang. 2018. Single-camera and inter-camera vehicle tracking and 3D speed estimation based on fusion of visual and semantic features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 108--115.Google ScholarGoogle ScholarCross RefCross Ref
  39. Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu, and Jenq-Neng Hwang. 2018. Exploit the Connectivity: Multi-Object Tracking with TrackletNet. arXiv preprint arXiv:1811.07258 (2018).Google ScholarGoogle Scholar
  40. Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. 2017. Multi-target, multi-camera tracking by hierarchical clustering: recent progress on DukeMTMC Project. arXiv preprint arXiv:1712.09531 (2017).Google ScholarGoogle Scholar

Index Terms

  1. Monocular Visual Object 3D Localization in Road Scenes

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '19: Proceedings of the 27th ACM International Conference on Multimedia
            October 2019
            2794 pages
            ISBN:9781450368896
            DOI:10.1145/3343031

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 October 2019

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader