research-article

Monocular Visual Object 3D Localization in Road Scenes

Authors:
Yizhou Wang

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Yen-Ting Huang

National ChengChi University, Hsinchu, Taiwan Roc

National ChengChi University, Hsinchu, Taiwan Roc
View Profile

,
Jenq-Neng Hwang

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 917–925https://doi.org/10.1145/3343031.3350924

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 917–925

ABSTRACT

3D localization of objects in road scenes is important for autonomous driving and advanced driver-assistance systems (ADAS). However, with common monocular camera setups, 3D information is difficult to obtain. In this paper, we propose a novel and robust method for 3D localization of monocular visual objects in road scenes by joint integration of depth estimation, ground plane estimation, and multi-object tracking techniques. Firstly, an object depth estimation method with depth confidence is proposed by utilizing the monocular depthmap from a CNN. Secondly, an adaptive ground plane estimation using both dense and sparse features is proposed to localize the objects when their depth estimation is not reliable. Thirdly, temporal information is taken into consideration by a new object tracklet smoothing method. Unlike most existing methods which only consider vehicle localization, our method is applicable for common moving objects in the road scenes, including pedestrians, vehicles, cyclists, etc. Moreover, the input depthmap can be replaced by some equivalent depth information from other sensors, like LiDAR, depth camera and Radar, which makes our system much more competitive compared with other object localization methods. As evaluated on KITTI dataset, our method achieves favorable performance on 3D localization of both pedestrians and vehicles when compared with the state-of-the-art vehicle localization methods, though no published performance on pedestrian 3D localization can be compared with, from the best of our knowledge.

References

Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence , Vol. 34, 11 (2012), 2274--2282.Google ScholarDigital Library
Junaid Ahmed Ansari, Sarthak Sharma, Anshuman Majumdar, J Krishna Murthy, and K Madhava Krishna. 2018. The Earth Ain't Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 8404--8410.Google ScholarDigital Library
Frédéric Chausse, Romuald Aufrère, and Roland Chapuis. 2000. Recovering the 3D shape of a road by on-board monocular vision. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000 , Vol. 1. IEEE, 325--328.Google ScholarCross Ref
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence , Vol. 40, 4 (2018), 834--848.Google Scholar
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition . 3213--3223.Google ScholarCross Ref
Ralf Dragon and Luc Van Gool. 2014. Ground plane estimation using a hidden markov model. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4026--4033.Google ScholarDigital Library
David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.Google ScholarDigital Library
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems. 2366--2374.Google Scholar
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision. 3038--3046.Google ScholarCross Ref
Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision. Springer, 740--756.Google ScholarCross Ref
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research , Vol. 32, 11 (2013), 1231--1237.Google ScholarDigital Library
Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 270--279.Google ScholarCross Ref
Osian Haines and Andrew Calway. 2015. Recognising planes in a single image. IEEE transactions on pattern analysis and machine intelligence , Vol. 37, 9 (2015), 1849--1861.Google ScholarDigital Library
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.Google ScholarCross Ref
Tsung-Wei Huang, Jenq-Neng Hwang, Suzanne Romain, and Farron Wallace. 2018. Fish Tracking and Segmentation from Stereo Videos on the Wild Sea Surface for Electronic Monitoring of Rail Fishing. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google Scholar
Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et almbox. 2018. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology , Vol. 28, 10 (2018), 2896--2907.Google ScholarDigital Library
Hyungjin Kim, Bingbing Liu, and Hyun Myung. 2017. Road-feature extraction using point cloud and 3D LiDAR sensor for vehicle localization. In 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI). IEEE, 891--892.Google ScholarCross Ref
Kuan-Hui Lee, Jenq-Neng Hwang, Greg Okopal, and James Pitton. 2016. Ground-moving-platform-based human tracking using visual SLAM and constrained multiple kernels. IEEE Transactions on Intelligent Transportation Systems , Vol. 17, 12 (2016), 3602--3612.Google ScholarDigital Library
Tao Liu, Yong Liu, Zheng Tang, and Jenq-Neng Hwang. 2017. Adaptive ground plane estimation for moving camera-based 3D object tracking. In 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarCross Ref
Yunze Man, Xinshuo Weng, and Kris Kitani. 2018. GroundNet: Segmentation-Aware Monocular Ground Plane Estimation with Geometric Consistency. arXiv preprint arXiv:1811.07222 (2018).Google Scholar
Matthew W McDaniel, Takayuki Nishihata, Christopher A Brooks, and Karl Iagnemma. 2010. Ground plane identification using LIDAR in forested environments. In 2010 IEEE International Conference on Robotics and Automation. IEEE, 3831--3836.Google ScholarCross Ref
Anton Milan, Konrad Schindler, and Stefan Roth. 2016. Multi-target tracking by discrete-continuous energy minimization. IEEE transactions on pattern analysis and machine intelligence , Vol. 38, 10 (2016), 2054--2068.Google ScholarDigital Library
Michel Moukari, Sylvaine Picard, L Simoni, and Frédéric Jurie. 2018. Deep multi-scale architectures for monocular depth estimation. In 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2940--2944.Google ScholarCross Ref
Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 2017. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7074--7082.Google ScholarCross Ref
Faisal Mufti, Robert Mahony, and Jochen Heinzmann. 2012. Robust estimation of planar surfaces using spatio-temporal RANSAC for applications in autonomous vehicle navigation. Robotics and Autonomous Systems , Vol. 60, 1 (2012), 16--28.Google ScholarDigital Library
J Krishna Murthy, GV Sai Krishna, Falak Chhaya, and K Madhava Krishna. 2017a. Reconstructing vehicles from a single image: Shape priors for road scene understanding. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 724--731.Google ScholarDigital Library
J Krishna Murthy, Sarthak Sharma, and K Madhava Krishna. 2017b. Shape priors for real-time monocular object localization in dynamic environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 1768--1774.Google ScholarCross Ref
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google ScholarCross Ref
Akshay Rangesh and Mohan M Trivedi. 2018. Ground Plane Polling for 6DoF Pose Estimation of Objects on the Road. arXiv preprint arXiv:1811.06666 (2018).Google Scholar
Ergys Ristani and Carlo Tomasi. 2018. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6036--6046.Google ScholarCross Ref
Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2009. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence , Vol. 31, 5 (2009), 824--840.Google Scholar
Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104--4113.Google ScholarCross Ref
Stephen Se and Michael Brady. 2002. Ground plane estimation, error analysis and applications. Robotics and Autonomous systems , Vol. 39, 2 (2002), 59--71.Google Scholar
Shiyu Song and Manmohan Chandraker. 2014. Robust scale estimation in real-time monocular SFM for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1566--1573.Google ScholarDigital Library
Shiyu Song and Manmohan Chandraker. 2015. Joint SFM and detection cues for monocular 3D localization in road scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3734--3742.Google ScholarCross Ref
Siyu Tang, Bjoern Andres, Miykhaylo Andriluka, and Bernt Schiele. 2015. Subgraph decomposition for multi-target tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5033--5041.Google ScholarCross Ref
Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. 2017. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 3539--3548.Google ScholarCross Ref
Zheng Tang, Gaoang Wang, Hao Xiao, Aotian Zheng, and Jenq-Neng Hwang. 2018. Single-camera and inter-camera vehicle tracking and 3D speed estimation based on fusion of visual and semantic features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 108--115.Google ScholarCross Ref
Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu, and Jenq-Neng Hwang. 2018. Exploit the Connectivity: Multi-Object Tracking with TrackletNet. arXiv preprint arXiv:1811.07258 (2018).Google Scholar
Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. 2017. Multi-target, multi-camera tracking by hierarchical clustering: recent progress on DukeMTMC Project. arXiv preprint arXiv:1712.09531 (2017).Google Scholar

Index Terms

Monocular Visual Object 3D Localization in Road Scenes
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Object detection
        Tracking
      2. Image and video acquisition
        3D imaging

Recommendations

Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving
CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

Scale drift is a crucial challenge for monocular autonomous driving to emulate the performance of stereo. This paper presents a real-time monocular SFM system that corrects for scale drift using a novel cue combination framework for ground plane ...
Read More
Predictive monocular odometry (PMO)

Visual odometry using only a monocular camera faces more algorithmic challenges than stereo odometry. We present a robust monocular visual odometry framework for automotive applications. An extended propagation-based tracking framework is proposed which ...
Read More
Monocular camera localization in 3D LiDAR maps
2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Localizing a camera in a given map is essential for vision-based navigation. In contrast to common methods for visual localization that use maps acquired with cameras, we propose a novel approach, which tracks the pose of monocular camera with respect to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
autonomous driving
ground plane estimation
monocular depthmap
object localization
tracklet smoothing
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 604
  Total Downloads
- Downloads (Last 12 months)83
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Monocular Visual Object 3D Localization in Road Scenes

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving

Predictive monocular odometry (PMO)

Monocular camera localization in 3D LiDAR maps

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Monocular Visual Object 3D Localization in Road Scenes

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving

Predictive monocular odometry (PMO)

Monocular camera localization in 3D LiDAR maps

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media