research-article

Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames

Authors:

Fabrice Matulic,

Takeo Igarashi,

Keita HiguchiAuthors Info & Claims

Proceedings of the ACM on Human-Computer Interaction, Volume 7, Issue ISS

Article No.: 440, Pages 309 - 326

https://doi.org/10.1145/3626476

Published: 01 November 2023 Publication History

Abstract

Segmenting and determining the 3D bounding boxes of objects of interest in RGB videos is an important task for a variety of applications such as augmented reality, navigation, and robotics. Supervised machine learning techniques are commonly used for this, but they need training datasets: sets of images with associated 3D bounding boxes manually defined by human annotators using a labelling tool. However, precisely placing 3D bounding boxes can be difficult using conventional 3D manipulation tools on a 2D interface. To alleviate that burden, we propose a novel technique with which 3D bounding boxes can be created by simply drawing 2D bounding rectangles on multiple frames of a video sequence showing the object from different angles. The method uses reconstructed dense 3D point clouds from the video and computes tightly fitting 3D bounding boxes of desired objects selected by back-projecting the 2D rectangles. We show concrete application scenarios of our interface, including training dataset creation and editing 3D spaces and videos. An evaluation comparing our technique with a conventional 3D annotation tool shows that our method results in higher accuracy. We also confirm that the bounding boxes created with our interface have a lower variance, likely yielding more consistent labels and datasets.

Supplementary Material

Video (iss23main-p5974-p-video.mp4)

Video figure

Download
24.78 MB

Teaser (iss23main-p5974-p-teaser.mp4)

Video Figure

Download
13.48 MB

References

[1]

Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. 2021. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7822–7831.

[2]

Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X Chang, and Matthias Nieß ner. 2019. Scan2cad: Learning cad model alignment in rgb-d scans. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2614–2623.

[3]

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11621–11631.

[4]

Dan Cernea. 2020. OpenMVS: Multi-View Stereo Reconstruction Library. https://cdcseacave.github.io/openMVS

[5]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20, 1 (1960), 37–46.

[6]

CVAT.ai Corporation. 2023. Computer Vision Annotation Tool (CVAT) (v2.4.9). https://doi.org/10.5281/zenodo.8095553

[7]

Robert DeBortoli, Li Fuxin, Ashish Kapoor, and Geoffrey A Hollinger. 2021. Adversarial training on point clouds for sim-to-real 3D object detection. IEEE Robotics and Automation Letters, 6, 4 (2021), 6662–6669.

[8]

Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19). Association for Computing Machinery, New York, NY, USA. 2276–2279. isbn:9781450368896 https://doi.org/10.1145/3343031.3350535

Digital Library

[9]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32, 11 (2013), 1231–1237.

Digital Library

[10]

Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, and Sebastian Dorn. 2020. A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320.

[11]

Keita Higuchi, Soichiro Matsuda, Rie Kamikubo, Takuya Enomoto, Yusuke Sugano, Junichi Yamamoto, and Yoichi Sato. 2018. Visualizing Gaze Direction to Support Video Coding of Social Attention for Children with Autism Spectrum Disorder. In 23rd International Conference on Intelligent User Interfaces (IUI ’18). Association for Computing Machinery, New York, NY, USA. 571–582. isbn:9781450349451 https://doi.org/10.1145/3172944.3172960

Digital Library

[12]

Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. 2021. SAIL-VOS 3D: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1418–1428.

[13]

Mona Jalal, Josef B. Spjut, Ben Boudaoud, and Margrit Betke. 2019. SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition With Distractors. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 475–477.

[14]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643.

[15]

Aybora Koksal, Kutalmis Gokalp Ince, and Aydin Alatan. 2020. Effect of annotation errors on drone detection with YOLOv3. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 1030–1031.

[16]

Michael Laielli, James Smith, Giscard Biamby, Trevor Darrell, and Bjoern Hartmann. 2019. LabelAR: A Spatial Guidance Interface for Fast Computer Vision Image Collection. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (UIST ’19). Association for Computing Machinery, New York, NY, USA. 987–998. isbn:9781450368162 https://doi.org/10.1145/3332165.3347927

Digital Library

[17]

Walter S. Lasecki, Mitchell Gordon, Danai Koutra, Malte F. Jung, Steven P. Dow, and Jeffrey P. Bigham. 2014. Glance: Rapidly Coding Behavioral Video with the Crowd. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (UIST ’14). Association for Computing Machinery, New York, NY, USA. 551–562. isbn:9781450330695 https://doi.org/10.1145/2642918.2647367

Digital Library

[18]

Bettina Laugwitz, Theo Held, and Martin Schrepp. 2008. Construction and Evaluation of a User Experience Questionnaire. In HCI and Usability for Education and Work, Andreas Holzinger (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 63–76. isbn:978-3-540-89350-9

[19]

Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. 2022. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]

Tzuta Lin. 2015. LabelImg. https://github.com/heartexlabs/labelImg

[21]

Jiaxin Ma, Yoshitaka Ushiku, and Miori Sagara. 2022. The Effect of Improving Annotation Quality on Object Detection Datasets: A Preliminary Study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4850–4859.

[22]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65, 1 (2021), 99–106.

Digital Library

[23]

Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Trans. Graph., 41, 4 (2022), Article 102, July, 15 pages. https://doi.org/10.1145/3528223.3530127

Digital Library

[24]

Oleg Muratov, Yury Slynko, Vitaly Chernov, Maria Lyubimtseva, Artem Shamsuarov, and Victor Bucha. 2016. 3DCapture: 3D Reconstruction for a Smartphone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 75–82.

[25]

Jeffri Murrugarra-Llerena, Lucas N Kirsten, and Claudio R Jung. 2022. Can We Trust Bounding Box Annotations for Object Detection? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4813–4822.

[26]

Yutaka Ohtake, Alexander Belyaev, and Hans-Peter Seidel. 2003. A multi-scale approach to 3D scattered data interpolation with compactly supported basis functions. In 2003 Shape Modeling International. 153–161.

[27]

Hao Ouyang, Tengfei Wang, and Qifeng Chen. 2021. Internal Video Inpainting by Implicit Long-range Propagation. In International Conference on Computer Vision (ICCV).

[28]

Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, and Jie Lin. 2020. A 3D dataset: Towards autonomous driving in challenging environments. In Proc. of The International Conference in Robotics and Automation (ICRA). 2267–2273.

[29]

Xun Qian, Fengming He, Xiyun Hu, Tianyi Wang, and Karthik Ramani. 2022. ARnnotate: An Augmented Reality Interface for Collecting Custom Dataset of 3D Hand-Object Interaction Pose Estimation. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA. Article 41, 14 pages. isbn:9781450393201 https://doi.org/10.1145/3526113.3545663

Digital Library

[30]

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. 2021. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10912–10922.

[31]

Christoph Sager, Patrick Zschech, and Niklas Kuhl. 2022. labelCloud: A Lightweight Labeling Tool for Domain-Agnostic 3D Object Detection in Point Clouds. Computer-Aided Design and Applications, 19, 6 (2022), mar, 1191–1206. https://doi.org/10.14733/cadaps.2022.1191-1206

[32]

Christoph Sager, Patrick Zschech, and Niklas Kühl. 2021. labelCloud: A Lightweight Domain-Independent Labeling Tool for 3D Object Detection in Point Clouds. arxiv:2103.04970.

[33]

Andrea Schankin, Matthias Budde, Till Riedel, and Michael Beigl. 2022. Psychometric Properties of the User Experience Questionnaire (UEQ). In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA. Article 466, 11 pages. isbn:9781450391573 https://doi.org/10.1145/3491102.3502098

Digital Library

[34]

Johannes Lutz Schönberger and Jan-Michael Frahm. 2016. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR).

[35]

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. 2016. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV).

[36]

Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. 2022. OnePose: One-Shot Object Pose Estimation without CAD Models. CVPR.

[37]

Theophanis Tsandilas. 2018. Fallacies of Agreement: A Critical Review of Consensus Assessment Methods for Gesture Elicitation. ACM Trans. Comput.-Hum. Interact., 25, 3 (2018), Article 18, jun, 49 pages. issn:1073-0516 https://doi.org/10.1145/3182168

Digital Library

[38]

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nieß ner. 2019. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7658–7667.

[39]

Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision. 75–82.

[40]

Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy. 2019. Deep Flow-Guided Video Inpainting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]

Jenny Yuen, Bryan Russell, Ce Liu, and Antonio Torralba. 2009. LabelMe video: Building a video database with human annotations. In 2009 IEEE 12th International Conference on Computer Vision. 1451–1458. https://doi.org/10.1109/ICCV.2009.5459289

[42]

Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. 2020. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI). 737–744.

[43]

Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 2019. 3D point capsule networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1009–1018.

[44]

Shuaifeng Zhi, Edgar Sucar, Andre Mouton, Iain Haughton, Tristan Laidlow, and Andrew J. Davison. 2021. iLabel: Interactive Neural Scene Labelling. arXiv.

[45]

Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. 2018. Open3D: A Modern Library for 3D Data Processing. arXiv:1801.09847.

[46]

Walter Zimmer, Akshay Rangesh, and Mohan Trivedi. 2019. 3d bat: A semi-automatic, web-based 3d annotation toolbox for full-surround, multi-modal data streams. In 2019 IEEE Intelligent Vehicles Symposium (IV). 1816–1821.

Digital Library

Index Terms

Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools

Recommendations

Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames
ISS Companion '23: Companion Proceedings of the 2023 Conference on Interactive Surfaces and Spaces

This demonstration is invited from ISS 2023 paper track for https://doi.org/10.1145/3626476. Segmenting and determining the 3D bounding boxes of objects of interest in RGB videos is an important task for a variety of applications such as augmented ...
Adaptive graph guided embedding for multi-label annotation
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Multi-label annotation is challenging since a large amount of well-labeled training data are required to achieve promising performance. However, providing such data is expensive while unlabeled data are widely available. To this end, we propose a novel ...
Semi-supervised multi-instance multi-label learning for video annotation task
MM '12: Proceedings of the 20th ACM international conference on Multimedia

Traditional approaches for automatic video annotation usually represent one video clip with a flat feature vector, neglecting the fact that video data contain natural structures. It is also noteworthy that a video clip is often relevant to multiple ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction

Proceedings of the ACM on Human-Computer Interaction Volume 7, Issue ISS

December 2023

482 pages

EISSN:2573-0142

DOI:10.1145/3554314

Editor:
Jeffrey Nichols
Apple, USA

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2023

Published in PACMHCI Volume 7, Issue ISS

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
184
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents