Skip to main content

A Self-supervised Framework for Human Instance Segmentation

  • Conference paper
  • First Online:
Book cover Computer Vision – ECCV 2020 Workshops (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

Abstract

Existing approaches for human-centered tasks such as human instance segmentation are focused on improving the architectures of models, leveraging weak supervision or transforming supervision among related tasks. Nonetheless, the structures are highly specific and the weak supervision is limited by available priors or number of related tasks. In this paper, we present a novel self-supervised framework for human instance segmentation. The framework includes one module which iteratively conducts mutual refinement between segmentation and optical flow estimation, and the other module which iteratively refines pose estimations by exploring the prior knowledge about the consistency in human graph structures from consecutive frames. The results of the proposed framework are employed for fine-tuning segmentation networks in a feedback fashion. Experimental results on the OCHuman and COCOPersons datasets demonstrate that the self-supervised framework achieves current state-of-the-art performance against existing models on the challenging datasets without requiring additional labels. Unlabeled video data is utilized together with prior knowledge to significantly improve performance and reduce the reliance on annotations. Code released at: https://github.com/AllenYLJiang/SSINS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andriluka, M., et al.: PoseTrack: a benchmark for human pose estimation and tracking. In: CVPR (2018)

    Google Scholar 

  2. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)

    Google Scholar 

  3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016)

  4. Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: CVPR (2016)

    Google Scholar 

  5. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49

    Chapter  Google Scholar 

  6. Chollet, F.: Xception: deep learning with depthwise separable convolutions. arXiv Preprint arXiv:1610.02357 (2016)

  7. Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation. In: CVPR (2015)

    Google Scholar 

  8. Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: CVPR (2016)

    Google Scholar 

  9. Fang, H.S., Lu, G., Fang, X., Xie, J., Tai, Y.W., Lu., C.: Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In: CVPR (2018)

    Google Scholar 

  10. Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: CVPR, pp. 2334–2343 (2017)

    Google Scholar 

  11. Girshick, R., Iandola, F., Darrell, T., Malik, J.: Deformable part models are convolutional neural networks. In: CVPR (2015)

    Google Scholar 

  12. Girshick, R., Radosavovic, I., Gkioxari, G., Dollar, P., He, K.: Detectron (2018). https://github.com/facebookresearch/detectron/

  13. Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., Lin, L.: Graphonomy: universal human parsing via graph transfer learning. In: CVPR (2019)

    Google Scholar 

  14. Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., Lin, L.: Instance-level human parsing via part grouping network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 805–822. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_47

    Chapter  Google Scholar 

  15. Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L.: Look into person: self-supervised structure sensitive learning and a new benchmark for human parsing. In: CVPR (2017)

    Google Scholar 

  16. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. arXiv Preprint arXiv:1703.06870v3 (2018)

  17. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016)

  18. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. http://lmb.informatik.uni-freiburg.de//Publications/2017/IMKDB17

  19. Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: CVPR (2017)

    Google Scholar 

  20. Liang, X., et al.: Deep human parsing with active template regression. TPAMI 37, 2402–2414 (2015)

    Article  Google Scholar 

  21. Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.: Semantic object parsing with local-global long short term memory. In: CVPR (2016)

    Google Scholar 

  22. Liang, X., et al.: Human parsing with contextualized convolutional neural network. In: ICCV (2015)

    Google Scholar 

  23. Lifkooee, M.Z., Liu, C., Liang, Y., Zhu, Y., Li, X.: Real-time avatar pose transfer and motion generation using locally encoded Laplacian offsets. JCST 34, 256–271 (2019)

    Google Scholar 

  24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, C., et al.: Auto-DeepLab: hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv 1901.02985 (2018)

    Google Scholar 

  26. Liu, S., Jia, J., Fidler, S., Urtasun., R.: SGN: sequential grouping networks for instance segmentation. In: ICCV (2017)

    Google Scholar 

  27. Liu, S., et al.: Matching-CNN meets KNN: Quasi-parametric human parsing. In: CVPR (2015)

    Google Scholar 

  28. Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: NIPS (2017)

    Google Scholar 

  29. Papandreou, G., Zhu, T., Chen, L.C., Gidaris, S., Tompson, J., Murphy, K.: PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. arXiv Preprint arXiv:1803.08225 (2018)

  30. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)

    Google Scholar 

  31. Xia, F., Wang, P., Chen, L.C., Yuille, A.L.: Zoom better to see clearer: human part segmentation with auto zoom net. In: ECCV (2016)

    Google Scholar 

  32. Xia, S., Gao, L., Lai, Y.K., Yuan, M.Z., Chai, J.: A survey on human performance capture and animation. JCST 32, 536–554 (2017)

    Google Scholar 

  33. Yang, L., Song, Q., Wang, Z., Jiang, M.: Parsing R-CNN for instance-level human analysis. In: CVPR (2019)

    Google Scholar 

  34. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)

    Google Scholar 

  35. Zhang, S.H., et al.: Pose2Seg: detection free human instance segmentation. In: CVPR (2019)

    Google Scholar 

  36. Zhao, J., et al.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. arXiv preprint arXiv 1804.03287 (2018)

    Google Scholar 

  37. Zhou, Q., Liang, X., Gong, K., Lin, L.: Adaptive temporal encoding network for video instance-level human parsing. In: ACM MM (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yalong Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, Y., Ding, W., Li, H., Yang, H., Wang, X. (2020). A Self-supervised Framework for Human Instance Segmentation. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66096-3_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66095-6

  • Online ISBN: 978-3-030-66096-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics