ScribbleBox: Interactive Annotation Framework for Video Object Segmentation

Chen, Bowen; Ling, Huan; Zeng, Xiaohui; Gao, Jun; Xu, Ziyue; Fidler, Sanja

doi:10.1007/978-3-030-58601-0_18

Bowen Chen¹²,
Huan Ling^12,13,14,
Xiaohui Zeng^12,13,
Jun Gao^12,13,14,
Ziyue Xu¹² &
…
Sanja Fidler^12,13,14

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12358))

Included in the following conference series:

European Conference on Computer Vision

3794 Accesses
14 Citations

Abstract

Manually labeling video datasets for segmentation tasks is extremely time consuming. We introduce ScribbleBox, an interactive framework for annotating object instances with masks in videos with a significant boost in efficiency. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in box placements, thus typically requiring only a few clicks to annotate a track to a sufficient accuracy. Segmentation masks are corrected via scribbles which are propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with an average of 9.14 clicks per box track, and only 4 frames requiring scribble annotation in a video of 65.3 frames on average.

B. Chen and H. Ling—Authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Strike the Balance: On-the-Fly Uncertainty Based User Interactions for Long-Term Video Object Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

Appearance-Based Refinement for Object-Centric Motion Segmentation

References

Acuna, D., Ling, H., Kar, A., Fidler, S.: Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In CVPR (2018)
Google Scholar
Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, 14–20 October 2007, pp. 1–8. IEEE Computer Society (2007)
Google Scholar
Bai, X., Wang, J., Simons, D., Sapiro, G.: Video SnapCut: robust video object cutout using localized classifiers. ACM Trans. Graph. 28(3) (2009). Article no. 70
Google Scholar
Benard, A., Gygli, M.: Interactive video object segmentation in the wild. ArXiv, abs/1801.00269 (2018)
Google Scholar
Caelles, S., Maninis, K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 5320–5329. IEEE Computer Society (2017)
Google Scholar
Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv:1803.00557 (2018)
Castrejon, L., Kundu, K., Urtasun, R., Fidler, S.: Annotating object instances with a Polygon-RNN. In: CVPR (2017)
Google Scholar
Chen, L.-C., Fidler, S., Yuille, A., Urtasun, R.: Beat the MTurkers: automatic image labeling from weak 3D supervision. In: CVPR (2014)
Google Scholar
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the EPIC-KITCHENS dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 753–771. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_44
Chapter Google Scholar
Damen, D., et al.: The EPIC-KITCHENS dataset: collection, challenges and baselines. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Google Scholar
Gao, J., Tang, C., Ganapathi-Subramanian, V., Huang, J., Su, H., Guibas, L.J.: DeepSpline: data-driven reconstruction of parametric curves and surfaces. arXiv preprint arXiv:1901.03781 (2019)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, June 2012
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: building a video database with human annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1451–1458, September 2009
Google Scholar
Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 123–141. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_8
Chapter Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings (2017)
Google Scholar
Levinkov, E., Tompkin, J., Bonneel, N., Kirchhoff, S., Andres, B., Pfister, H.: Interactive multicut video segmentation. In: PG 2016 (2016)
Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980, June 2018
Google Scholar
Li, Y., Sun, J., Shum, H.: Video object cut and paste. ACM Trans. Graph. 24(3), 595–600 (2005)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Z., Xie, J., Zhou, C., Hu, J., Zheng, W.: Interactive video object segmentation via spatio-temporal context aggregation and online learning. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019)
Google Scholar
Ling, H., Gao, J., Kar, A., Chen, W., Fidler, S.: Fast interactive object annotation with Curve-GCN. In: CVPR, June 2019
Google Scholar
Mahadevan, S., Voigtlaender, P., Leibe, B.: Iteratively trained interactive segmentation. arXiv preprint arXiv:1805.04398 (2018)
Manen, S., Gygli, M., Dai, D., Van Gool, L.: PathTrack: fast trajectory annotation with path supervision. arXiv:1703.02437 (2017)
Maninis, K.-K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)
Google Scholar
Mortensen, E.N., Barrett, W.A.: Intelligent scissors for image composition. In: SIGGRAPH, pp. 191–198 (1995)
Google Scholar
Nagaraja, N.S., Schmidt, F.R., Brox, T.: Video segmentation with just a few strokes. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 3235–3243. IEEE Computer Society (2015)
Google Scholar
Najafi, M., Kulharia, V., Ajanthan, T., Torr, P.H.S.: Similarity learning for dense label transfer. In: The 2018 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2018)
Google Scholar
Oh, S.W., Lee, J.-Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5247–5256 (2019)
Google Scholar
Oh, S.W., Lee, J.-Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9226–9235 (2019)
Google Scholar
Price, B.L., Morse, B.S., Cohen, S.: LIVEcut: learning-based interactive video segmentation by evaluation of multiple propagated cues. In: IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, 27 September–4 October 2009, pp. 779–786. IEEE Computer Society (2009)
Google Scholar
Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. In: SIGGRAPH (2004)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation (2019)
Google Scholar
Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 101(1), 184–204 (2013). https://doi.org/10.1007/s11263-012-0564-1
Article Google Scholar
Vondrick, C., Ramanan, D.: Video annotation and tracking with active learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 2011, USA, pp. 28–36. Curran Associates Inc. (2011)
Google Scholar
Wang, J., Bhat, P., Colburn, A., Agrawala, M., Cohen, M.F.: Interactive video cutout. ACM Trans. Graph. 24(3), 585–594 (2005)
Article Google Scholar
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Google Scholar
Wang, Z., Ling, H., Acuna, D., Kar, A., Fidler, S.: Object instance annotation with deep extreme level set evolution. In: CVPR (2019)
Google Scholar
Wug Oh, S., Lee, J.-Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
Google Scholar
Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_36
Chapter Google Scholar

Download references

Acknowledgments

This work was supported by NSERC. SF acknowledges the Canada CIFAR AI Chair award at the Vector Institute.

Author information

Authors and Affiliations

University of Toronto, Toronto, Canada
Bowen Chen, Huan Ling, Xiaohui Zeng, Jun Gao, Ziyue Xu & Sanja Fidler
Vector Institute, Toronto, Canada
Huan Ling, Xiaohui Zeng, Jun Gao & Sanja Fidler
NVIDIA, Santa Clara, USA
Huan Ling, Jun Gao & Sanja Fidler

Authors

Bowen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huan Ling
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Ziyue Xu
View author publications
You can also search for this author in PubMed Google Scholar
Sanja Fidler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bowen Chen , Huan Ling , Xiaohui Zeng , Ziyue Xu or Sanja Fidler .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 11133 KB)

Supplementary material 2 (pdf 12776 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, B., Ling, H., Zeng, X., Gao, J., Xu, Z., Fidler, S. (2020). ScribbleBox: Interactive Annotation Framework for Video Object Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12358. Springer, Cham. https://doi.org/10.1007/978-3-030-58601-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-58601-0_18
Published: 28 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58600-3
Online ISBN: 978-3-030-58601-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics