skip to main content
research-article

Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames

Published: 01 November 2023 Publication History

Abstract

Segmenting and determining the 3D bounding boxes of objects of interest in RGB videos is an important task for a variety of applications such as augmented reality, navigation, and robotics. Supervised machine learning techniques are commonly used for this, but they need training datasets: sets of images with associated 3D bounding boxes manually defined by human annotators using a labelling tool. However, precisely placing 3D bounding boxes can be difficult using conventional 3D manipulation tools on a 2D interface. To alleviate that burden, we propose a novel technique with which 3D bounding boxes can be created by simply drawing 2D bounding rectangles on multiple frames of a video sequence showing the object from different angles. The method uses reconstructed dense 3D point clouds from the video and computes tightly fitting 3D bounding boxes of desired objects selected by back-projecting the 2D rectangles. We show concrete application scenarios of our interface, including training dataset creation and editing 3D spaces and videos. An evaluation comparing our technique with a conventional 3D annotation tool shows that our method results in higher accuracy. We also confirm that the bounding boxes created with our interface have a lower variance, likely yielding more consistent labels and datasets.

Supplementary Material

Video (iss23main-p5974-p-video.mp4)
Video figure
Teaser (iss23main-p5974-p-teaser.mp4)
Video Figure

References

[1]
Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. 2021. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7822–7831.
[2]
Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X Chang, and Matthias Nieß ner. 2019. Scan2cad: Learning cad model alignment in rgb-d scans. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2614–2623.
[3]
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11621–11631.
[4]
Dan Cernea. 2020. OpenMVS: Multi-View Stereo Reconstruction Library. https://cdcseacave.github.io/openMVS
[5]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20, 1 (1960), 37–46.
[6]
CVAT.ai Corporation. 2023. Computer Vision Annotation Tool (CVAT) (v2.4.9). https://doi.org/10.5281/zenodo.8095553
[7]
Robert DeBortoli, Li Fuxin, Ashish Kapoor, and Geoffrey A Hollinger. 2021. Adversarial training on point clouds for sim-to-real 3D object detection. IEEE Robotics and Automation Letters, 6, 4 (2021), 6662–6669.
[8]
Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19). Association for Computing Machinery, New York, NY, USA. 2276–2279. isbn:9781450368896 https://doi.org/10.1145/3343031.3350535
[9]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32, 11 (2013), 1231–1237.
[10]
Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, and Sebastian Dorn. 2020. A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320.
[11]
Keita Higuchi, Soichiro Matsuda, Rie Kamikubo, Takuya Enomoto, Yusuke Sugano, Junichi Yamamoto, and Yoichi Sato. 2018. Visualizing Gaze Direction to Support Video Coding of Social Attention for Children with Autism Spectrum Disorder. In 23rd International Conference on Intelligent User Interfaces (IUI ’18). Association for Computing Machinery, New York, NY, USA. 571–582. isbn:9781450349451 https://doi.org/10.1145/3172944.3172960
[12]
Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. 2021. SAIL-VOS 3D: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1418–1428.
[13]
Mona Jalal, Josef B. Spjut, Ben Boudaoud, and Margrit Betke. 2019. SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition With Distractors. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 475–477.
[14]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643.
[15]
Aybora Koksal, Kutalmis Gokalp Ince, and Aydin Alatan. 2020. Effect of annotation errors on drone detection with YOLOv3. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 1030–1031.
[16]
Michael Laielli, James Smith, Giscard Biamby, Trevor Darrell, and Bjoern Hartmann. 2019. LabelAR: A Spatial Guidance Interface for Fast Computer Vision Image Collection. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (UIST ’19). Association for Computing Machinery, New York, NY, USA. 987–998. isbn:9781450368162 https://doi.org/10.1145/3332165.3347927
[17]
Walter S. Lasecki, Mitchell Gordon, Danai Koutra, Malte F. Jung, Steven P. Dow, and Jeffrey P. Bigham. 2014. Glance: Rapidly Coding Behavioral Video with the Crowd. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (UIST ’14). Association for Computing Machinery, New York, NY, USA. 551–562. isbn:9781450330695 https://doi.org/10.1145/2642918.2647367
[18]
Bettina Laugwitz, Theo Held, and Martin Schrepp. 2008. Construction and Evaluation of a User Experience Questionnaire. In HCI and Usability for Education and Work, Andreas Holzinger (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 63–76. isbn:978-3-540-89350-9
[19]
Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. 2022. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20]
Tzuta Lin. 2015. LabelImg. https://github.com/heartexlabs/labelImg
[21]
Jiaxin Ma, Yoshitaka Ushiku, and Miori Sagara. 2022. The Effect of Improving Annotation Quality on Object Detection Datasets: A Preliminary Study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4850–4859.
[22]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65, 1 (2021), 99–106.
[23]
Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Trans. Graph., 41, 4 (2022), Article 102, July, 15 pages. https://doi.org/10.1145/3528223.3530127
[24]
Oleg Muratov, Yury Slynko, Vitaly Chernov, Maria Lyubimtseva, Artem Shamsuarov, and Victor Bucha. 2016. 3DCapture: 3D Reconstruction for a Smartphone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 75–82.
[25]
Jeffri Murrugarra-Llerena, Lucas N Kirsten, and Claudio R Jung. 2022. Can We Trust Bounding Box Annotations for Object Detection? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4813–4822.
[26]
Yutaka Ohtake, Alexander Belyaev, and Hans-Peter Seidel. 2003. A multi-scale approach to 3D scattered data interpolation with compactly supported basis functions. In 2003 Shape Modeling International. 153–161.
[27]
Hao Ouyang, Tengfei Wang, and Qifeng Chen. 2021. Internal Video Inpainting by Implicit Long-range Propagation. In International Conference on Computer Vision (ICCV).
[28]
Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, and Jie Lin. 2020. A 3D dataset: Towards autonomous driving in challenging environments. In Proc. of The International Conference in Robotics and Automation (ICRA). 2267–2273.
[29]
Xun Qian, Fengming He, Xiyun Hu, Tianyi Wang, and Karthik Ramani. 2022. ARnnotate: An Augmented Reality Interface for Collecting Custom Dataset of 3D Hand-Object Interaction Pose Estimation. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA. Article 41, 14 pages. isbn:9781450393201 https://doi.org/10.1145/3526113.3545663
[30]
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. 2021. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10912–10922.
[31]
Christoph Sager, Patrick Zschech, and Niklas Kuhl. 2022. labelCloud: A Lightweight Labeling Tool for Domain-Agnostic 3D Object Detection in Point Clouds. Computer-Aided Design and Applications, 19, 6 (2022), mar, 1191–1206. https://doi.org/10.14733/cadaps.2022.1191-1206
[32]
Christoph Sager, Patrick Zschech, and Niklas Kühl. 2021. labelCloud: A Lightweight Domain-Independent Labeling Tool for 3D Object Detection in Point Clouds. arxiv:2103.04970.
[33]
Andrea Schankin, Matthias Budde, Till Riedel, and Michael Beigl. 2022. Psychometric Properties of the User Experience Questionnaire (UEQ). In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA. Article 466, 11 pages. isbn:9781450391573 https://doi.org/10.1145/3491102.3502098
[34]
Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR).
[35]
Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV).
[36]
Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. 2022. OnePose: One-Shot Object Pose Estimation without CAD Models. CVPR.
[37]
Theophanis Tsandilas. 2018. Fallacies of Agreement: A Critical Review of Consensus Assessment Methods for Gesture Elicitation. ACM Trans. Comput.-Hum. Interact., 25, 3 (2018), Article 18, jun, 49 pages. issn:1073-0516 https://doi.org/10.1145/3182168
[38]
Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nieß ner. 2019. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7658–7667.
[39]
Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision. 75–82.
[40]
Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy. 2019. Deep Flow-Guided Video Inpainting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41]
Jenny Yuen, Bryan Russell, Ce Liu, and Antonio Torralba. 2009. LabelMe video: Building a video database with human annotations. In 2009 IEEE 12th International Conference on Computer Vision. 1451–1458. https://doi.org/10.1109/ICCV.2009.5459289
[42]
Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. 2020. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI). 737–744.
[43]
Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 2019. 3D point capsule networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1009–1018.
[44]
Shuaifeng Zhi, Edgar Sucar, Andre Mouton, Iain Haughton, Tristan Laidlow, and Andrew J. Davison. 2021. iLabel: Interactive Neural Scene Labelling. arXiv.
[45]
Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. 2018. Open3D: A Modern Library for 3D Data Processing. arXiv:1801.09847.
[46]
Walter Zimmer, Akshay Rangesh, and Mohan Trivedi. 2019. 3d bat: A semi-automatic, web-based 3d annotation toolbox for full-surround, multi-modal data streams. In 2019 IEEE Intelligent Vehicles Symposium (IV). 1816–1821.

Index Terms

  1. Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Human-Computer Interaction
    Proceedings of the ACM on Human-Computer Interaction  Volume 7, Issue ISS
    December 2023
    482 pages
    EISSN:2573-0142
    DOI:10.1145/3554314
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2023
    Published in PACMHCI Volume 7, Issue ISS

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D annotation
    2. datasets

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 184
      Total Downloads
    • Downloads (Last 12 months)104
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media