skip to main content
10.1145/3444685.3446257acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Incremental multi-view object detection from a moving camera

Published: 03 May 2021 Publication History

Abstract

Object detection in a single image is a challenging problem due to clutters, occlusions, and a large variety of viewing locations. This task can benefit from integrating multi-frame information captured by a moving camera. In this paper, we propose a method to increment object detection scores extracted from multiple frames captured from different viewpoints. For each frame, we run an efficient end-to-end object detector that outputs object bounding boxes, each of which is associated with the scores of categories and poses. The scores of detected objects are then stored in grid locations in 3D space. After observing multiple frames, the object scores stored in each grid location are integrated based on the best object pose hypothesis. This strategy requires the consistency of object categories and poses among multiple frames, and thus it significantly suppresses miss detections. The performance of the proposed method is evaluated on our newly created multi-class object dataset captured in robot simulation and real environments, as well as on a public benchmark dataset.

References

[1]
[n.d.]. The Princeton ModelNet. http://modelnet.cs.princeton.edu/.
[2]
Amr Bakry and Ahmed Elgammal. 2014. Untangling object-view manifold for multiview recognition and pose estimation. In Proceedings of European Conference on Computer Vision (ECCV).
[3]
Sid Yingze Bao, Mohit Bagra, Yu-Wei Chao, and Silvio Savarese. 2012. Semantic Structure from Motion with Points, Regions, and Objects. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4]
Sid Yingze Bao and Silvio Savarese. 2011. Semantic Structure from Motion. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5]
Sid Yingze Bao, Yu Xiang, and Silvio Savarese. 2012. Object Co-detection. In Proceedings of European Conference on Computer Vision (ECCV).
[6]
Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. 2018. Object Detection in Video with Spatiotemporal Sampling Networks. In Proceedings of European Conference on Computer Vision (ECCV).
[7]
Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 2016. Monocular 3d object detection for autonomous driving. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8]
Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 2015. 3d object proposals for accurate object class detection. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[9]
Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. 2017. Multi-view 3d object detection network for autonomous driving. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10]
Angela Dai and Matthias Nießner. 2018. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. In Proceedings of European Conference on Computer Vision (ECCV).
[11]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2020. Rescaling Egocentric Vision. CoRR abs/2006.13256 (2020).
[12]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. In Proceedings of European Conference on Computer Vision (ECCV).
[13]
Gilad Divon and Ayellet Tal. 2018. Viewpoint Estimation---Insights & Model. In Proceedings of European Conference on Computer Vision (ECCV).
[14]
Debidatta Dwibedi, Ishan Misra, and Martial Hebert. 2017. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection. In Proceedings of International Conference on Computer Vision (ICCV).
[15]
Mohamed Elhoseiny, Tarek El-Gaaly, Amr Bakry, and Ahmed Elgammal. 2016. A Comparative Analysis and Study of Multiview CNN Models for Joint Object Categorization and Pose Estimation. In Proceedings of International Conference on Machine Learning (ICML).
[16]
N. Fioraio and L. Di Stefano. 2013. Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17]
D. P. Frost, O. Kähler, and D. W. Murray. 2016. Object-aware bundle adjustment for correcting monocular scale drift. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA).
[18]
Dorian Gálvez-López, Marta Salas, Juan D. Tardás, and J.M.M. Montiel. 2016. Real-time monocular object SLAM. Robotics and Autonomous Systems 75 (2016), 435 -- 449.
[19]
P. Gay, V. Bansal, C. Rubino, and A. D. Bue. 2017. Probabilistic Structure from Motion with Objects (PSfMO). In Proceedings of International Conference on Computer Vision (ICCV).
[20]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of International Conference on Computer Vision (ICCV).
[21]
Omid Hosseini Jafari, Siva Karthik Mustikovela, Karl Pertsch, Eric Brachmann, and Carsten Rother. 2018. iPose: instance-aware 6D pose estimation of partly occluded objects. In Proceedings of Asian Conference on Computer Vision (ACCV).
[22]
Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. 2018. RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23]
Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. 2017. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of International Conference on Computer Vision (ICCV).
[24]
Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. 2018. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[25]
Shyam Sunder Kumar, Min Sun, and Silvio Savarese. 2012. Mobile object detection through client-server based vote transfer. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26]
Kevin Lai, Liefeng Bo, and Dieter Fox. 2014. Unsupervised feature learning for 3D scene labeling. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA).
[27]
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA).
[28]
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. A Scalable Tree-based Approach for Joint Object and Pose Recognition. In Proceedings of AAAI Conference on Artificial Intelligence.
[29]
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2012. Detection-based Object Labeling in 3D Scenes. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA). 1330--1337.
[30]
Chi Li, Jin Bai, and Gregory D. Hager. 2018. A Unified Framework for Multi-view Multi-class Object Pose Estimation. In Proceedings of European Conference on Computer Vision (ECCV).
[31]
John McCormac, Ronald Clark, Michael Bloesch, Andrew J. Davison, and Stefan Leutenegger. 2018. Fusion++: Volumetric Object-Level SLAM. In Proceedings of International Conference on 3D Vision (3DV).
[32]
Raúl Mur-Artal and Juan D. Tardós. 2017. ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. IEEE Transactions on Robotics 33, 5 (2017), 1255--1262.
[33]
Patrick Pérez, Michel Gangnet, and Andrew Blake. 2003. Poisson image editing. ACM Transactions on graphics (TOG) 22, 3 (2003), 313--318.
[34]
Sudeep Pillai and John Leonard. 2015. Monocular SLAM Supported Object Recognition. In Proceedings of Robotics: Science and Systems (RSS).
[35]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR abs/1804.02767 (2018). arXiv:1804.02767
[36]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS).
[37]
R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison. 2013. SLAM++ : Simultaneous Localisation and Mapping at the Level of Objects. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38]
Martin Simony, Stefan Milzy, Karl Amendey, and Horst-Michael Gross. 2018. Complex-YOLO: an Euler-region-proposal for real-time 3D object detection on point clouds. In Proceedings of European Conference on Computer Vision (ECCV).
[39]
Shuran Song and Jianxiong Xiao. 2016. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[40]
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. 2015. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of International Conference on Computer Vision (ICCV).
[41]
Hao Su, Charles R. Qi, Yangyan Li, and Leonidas J. Guibas. 2015. Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views. In Proceedings of International Conference on Computer Vision (ICCV).
[42]
N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid. 2017. Meaningful maps with object-oriented semantic mapping. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[43]
Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. 2018. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of European Conference on Computer Vision (ECCV).
[44]
Alexander Thomas, Vittorio Ferrar, Bastian Leibe, Tinne Tuytelaars, Bernt Schiel, and Luc Van Gool. 2006. Towards multi-view object class detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45]
Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. 2015. Data-driven 3d voxel patterns for object category recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46]
Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. 2018. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Proceedings of Robotics: Science and Systems (RSS).
[47]
Takashi Yamamoto, Koji Terada, Akiyoshi Ochiai, Fuminori Saito, Yoshiaki Asahara, and Kazuto Murase. 2019. Development of Human Support Robot as the research platform of a domestic mobile manipulator. ROBOMECH Journal 6, 1 (2019).
[48]
Mehran Yazdi and Thierry Bouwmans. 2018. New trends on moving object detection in video images captured by a moving camera: A survey. Computer Science Review 28 (2018), 157--177.
[49]
Haopeng Zhang, Tarek El-Gaaly, Ahmed M Elgammal, and Zhiguo Jiang. 2013. Joint Object and Pose Recognition Using Homeomorphic Manifold Analysis. In Proceedings of AAAI Conference on Artificial Intelligence.

Cited By

View all
  • (2024)Open-Ended Online Learning for Autonomous Visual PerceptionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324244835:8(10178-10198)Online publication date: Aug-2024
  • (2023)Semantic-Aware Dynamic Feature Selection and Fusion for Object Detection in UAV VideosProceedings of the 5th ACM International Conference on Multimedia in Asia10.1145/3595916.3626385(1-7)Online publication date: 6-Dec-2023
  • (2022)Classification of the Sidewalk Condition Using Self-Supervised Transfer Learning for Wheelchair Safety DrivingSensors10.3390/s2201038022:1(380)Online publication date: 5-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia
March 2021
512 pages
ISBN:9781450383080
DOI:10.1145/3444685
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D object recognition
  2. neural networks
  3. object detection

Qualifiers

  • Research-article

Conference

MMAsia '20
Sponsor:
MMAsia '20: ACM Multimedia Asia
March 7, 2021
Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Open-Ended Online Learning for Autonomous Visual PerceptionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324244835:8(10178-10198)Online publication date: Aug-2024
  • (2023)Semantic-Aware Dynamic Feature Selection and Fusion for Object Detection in UAV VideosProceedings of the 5th ACM International Conference on Multimedia in Asia10.1145/3595916.3626385(1-7)Online publication date: 6-Dec-2023
  • (2022)Classification of the Sidewalk Condition Using Self-Supervised Transfer Learning for Wheelchair Safety DrivingSensors10.3390/s2201038022:1(380)Online publication date: 5-Jan-2022
  • (2022)Detecting More Objects Indoors: a Depth-guided Detector for Mobile Robots2022 IEEE International Conference on Robotics and Biomimetics (ROBIO)10.1109/ROBIO55434.2022.10011811(904-908)Online publication date: 5-Dec-2022
  • (2022)Multi-view aggregation for real-time accurate object detection of a moving cameraJournal of Real-Time Image Processing10.1007/s11554-022-01253-919:6(1169-1179)Online publication date: 1-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media