Abstract
Manually labeling video datasets for segmentation tasks is extremely time consuming. We introduce ScribbleBox, an interactive framework for annotating object instances with masks in videos with a significant boost in efficiency. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in box placements, thus typically requiring only a few clicks to annotate a track to a sufficient accuracy. Segmentation masks are corrected via scribbles which are propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with an average of 9.14 clicks per box track, and only 4 frames requiring scribble annotation in a video of 65.3 frames on average.
B. Chen and H. Ling—Authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Acuna, D., Ling, H., Kar, A., Fidler, S.: Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In CVPR (2018)
Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, 14–20 October 2007, pp. 1–8. IEEE Computer Society (2007)
Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object cutout using localized classifiers. ACM Trans. Graph. 28(3) (2009). Article no. 70
Benard, A., Gygli, M.: Interactive video object segmentation in the wild. ArXiv, abs/1801.00269 (2018)
Caelles, S., Maninis, K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 5320–5329. IEEE Computer Society (2017)
Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv:1803.00557 (2018)
Castrejon, L., Kundu, K., Urtasun, R., Fidler, S.: Annotating object instances with a Polygon-RNN. In: CVPR (2017)
Chen, L.-C., Fidler, S., Yuille, A., Urtasun, R.: Beat the MTurkers: automatic image labeling from weak 3D supervision. In: CVPR (2014)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Damen, D., et al.: Scaling egocentric vision: the EPIC-KITCHENS dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44
Damen, D., et al.: The EPIC-KITCHENS dataset: collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Gao, J., Tang, C., Ganapathi-Subramanian, V., Huang, J., Su, H., Guibas, L.J.: DeepSpline: data-driven reconstruction of parametric curves and surfaces. arXiv preprint arXiv:1901.03781 (2019)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, June 2012
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: building a video database with human annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1451–1458, September 2009
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_8
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings (2017)
Levinkov, E., Tompkin, J., Bonneel, N., Kirchhoff, S., Andres, B., Pfister, H.: Interactive multicut video segmentation. In: PG 2016 (2016)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980, June 2018
Li, Y., Sun, J., Shum, H.: Video object cut and paste. ACM Trans. Graph. 24(3), 595–600 (2005)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Z., Xie, J., Zhou, C., Hu, J., Zheng, W.: Interactive video object segmentation via spatio-temporal context aggregation and online learning. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)
Ling, H., Gao, J., Kar, A., Chen, W., Fidler, S.: Fast interactive object annotation with Curve-GCN. In: CVPR, June 2019
Mahadevan, S., Voigtlaender, P., Leibe, B.: Iteratively trained interactive segmentation. arXiv preprint arXiv:1805.04398 (2018)
Manen, S., Gygli, M., Dai, D., Van Gool, L.: PathTrack: fast trajectory annotation with path supervision. arXiv:1703.02437 (2017)
Maninis, K.-K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)
Mortensen, E.N., Barrett, W.A.: Intelligent scissors for image composition. In: SIGGRAPH, pp. 191–198 (1995)
Nagaraja, N.S., Schmidt, F.R., Brox, T.: Video segmentation with just a few strokes. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 3235–3243. IEEE Computer Society (2015)
Najafi, M., Kulharia, V., Ajanthan, T., Torr, P.H.S.: Similarity learning for dense label transfer. In: The 2018 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2018)
Oh, S.W., Lee, J.-Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5247–5256 (2019)
Oh, S.W., Lee, J.-Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9226–9235 (2019)
Price, B.L., Morse, B.S., Cohen, S.: LIVEcut: learning-based interactive video segmentation by evaluation of multiple propagated cues. In: IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, 27 September–4 October 2009, pp. 779–786. IEEE Computer Society (2009)
Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. In: SIGGRAPH (2004)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation (2019)
Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 101(1), 184–204 (2013). https://doi.org/10.1007/s11263-012-0564-1
Vondrick, C., Ramanan, D.: Video annotation and tracking with active learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 2011, USA, pp. 28–36. Curran Associates Inc. (2011)
Wang, J., Bhat, P., Colburn, A., Agrawala, M., Cohen, M.F.: Interactive video cutout. ACM Trans. Graph. 24(3), 585–594 (2005)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Wang, Z., Ling, H., Acuna, D., Kar, A., Fidler, S.: Object instance annotation with deep extreme level set evolution. In: CVPR (2019)
Wug Oh, S., Lee, J.-Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_36
Acknowledgments
This work was supported by NSERC. SF acknowledges the Canada CIFAR AI Chair award at the Vector Institute.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 11133 KB)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, B., Ling, H., Zeng, X., Gao, J., Xu, Z., Fidler, S. (2020). ScribbleBox: Interactive Annotation Framework for Video Object Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12358. Springer, Cham. https://doi.org/10.1007/978-3-030-58601-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-58601-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58600-3
Online ISBN: 978-3-030-58601-0
eBook Packages: Computer ScienceComputer Science (R0)