ABSTRACT
This demonstration is invited from ISS 2023 paper track for https://doi.org/10.1145/3626476. Segmenting and determining the 3D bounding boxes of objects of interest in RGB videos is an important task for a variety of applications such as augmented reality, navigation, and robotics. Supervised machine learning techniques are commonly used for this, but they need training datasets: sets of images with associated 3D bounding boxes manually defined by human annotators using a labelling tool. However, precisely placing 3D bounding boxes can be difficult using conventional 3D manipulation tools on a 2D interface. To alleviate that burden, we propose a novel technique with which 3D bounding boxes can be created by simply drawing 2D bounding rectangles on multiple frames of a video sequence showing the object from different angles. The method uses reconstructed dense 3D point clouds from the video and computes tightly fitting 3D bounding boxes of desired objects selected by back-projecting the 2D rectangles. We show concrete application scenarios of our interface, including training dataset creation and editing 3D spaces and videos. An evaluation comparing our technique with a conventional 3D annotation tool shows that our method results in higher accuracy. We also confirm that the bounding boxes created with our interface have a lower variance, likely yielding more consistent labels and datasets.
Index Terms
- Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames
Recommendations
Interactive 3D Annotation of Objects in Moving Videos from Sparse Multi-view Frames
Segmenting and determining the 3D bounding boxes of objects of interest in RGB videos is an important task for a variety of applications such as augmented reality, navigation, and robotics. Supervised machine learning techniques are commonly used for ...
Semi-supervised multi-instance multi-label learning for video annotation task
MM '12: Proceedings of the 20th ACM international conference on MultimediaTraditional approaches for automatic video annotation usually represent one video clip with a flat feature vector, neglecting the fact that video data contain natural structures. It is also noteworthy that a video clip is often relevant to multiple ...
Adaptive graph guided embedding for multi-label annotation
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial IntelligenceMulti-label annotation is challenging since a large amount of well-labeled training data are required to achieve promising performance. However, providing such data is expensive while unlabeled data are widely available. To this end, we propose a novel ...
Comments