Skip to main content
Log in

Crowded pose-guided multi-task learning for instance-level human parsing

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Instance-level human parsing remains challenging due to the similarity between human instances and background, complex interactions, and various poses. Aiming at assigning each human-related pixel a semantic label and associate each label with the corresponding instance simultaneously, a new top-down method based on multi-task learning guided by crowded pose estimation is proposed to learn instance-level human semantic part information. Firstly, we introduce a path attention feature pyramid to learn more robust multi-scale shared semantic features by changing the feature propagation to concatenation and increasing channel attention at each layer in order to solve the problem of complex background. Secondly, by improving the learned shared features via spatial attention and RC-ASPP, we design an instance-agnostic human parsing module to learn body part segmentation and edge information. In addition, we design a Mask-RCNN-based crowded pose estimation module that uses D-SPPE and hierarchical association rules to obtain pose information. Finally, we define fusion strategy and multi-task learning loss to fuse different semantic features and instance features, which can learn the final instance-level human parsing results in an end-to-end manner. Extensive experimental results on PASCAL-Person-Part and MHPv2.0 dataset verify the effectiveness of our proposed method that outperforms most of state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Li, P., Xu, Y., Wei, Y., Yang, Y.: Self-correction for human parsing. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3260–3271 (2022)

    Article  Google Scholar 

  2. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, pp. 4–9 (2015)

  3. Malik, Z., Shapiai, M.I.B.: Human action interpretation using convolutional neural network: a survey. Mach. Vis. Appl. 33(3), 37 (2022)

    Article  Google Scholar 

  4. Gupta, A., Shen, Z., Huang, T.S.: Text embedding bank for detailed image paragraph captioning. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 15791–15792 (2021)

  5. Wang, L., Ji, X., Mingxing Jia, Q.D.: Deformable part model based multiple pedestrian detection for video surveillance in crowded scenes. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 15791–15792 (2021)

  6. Li, Q., Arnab, A., Torr, P.H.S.: Holistic, instance-level human parsing. In: Proceedings of British Machine Vision Conference, pp. 4–7 (2017)

  7. Yang, L., Song, Q., Wang, Z., Jiang, M.: Parsing R-CNN for instance-level human analysis. In: Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition, pp. 364–373 (2019)

  8. Yang, L., Song, Q., Wang, Z., Hu, M., Liu, C., Xin, X., Jia, W., Xu, S.: Renovating parsing R-CNN for accurate multiple human parsing. In: Proceedings of the 16th European Conference on Computer Vision, pp. 421–437 (2020)

  9. Ruan, T., Liu, T., Huang, Z., Wei, Y., Wei, S., Zhao, Y.: Devil in the details: Towards accurate single and multiple human parsing. In: Proceedings of the 33rd Conference on Artificial Intelligence, pp. 4814–4821 (2019)

  10. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2017)

  11. He, H., Zhang, J., Thuraisingham, B., Tao, D.: Progressive one-shot human parsing. In: Proceedings of the 35th Conference on Artificial Intelligence, pp. 1522–1530 (2021)

  12. Ji, R., Du, D., Zhang, L., Wen, L., Wu, Y., Zhao, C., Huang, F., Lyu, S.: Learning semantic neural tree for human parsing. In: Proceedings of the 16th European Conference on Computer Vision, pp. 205–221 (2020)

  13. Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., Lin, L.: Instance-level human parsing via part grouping network. In: Proceedings of the 15th European Conference on Computer Vision, pp. 205–221 (2018)

  14. Zhang, Z., Su, C., Zheng, L., Xie, X.: Correlating edge, pose with parsing. In: Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition, pp. 8897–8906 (2020)

  15. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)

  16. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Proceedings of the Annual Conference on Neural Information Processing Systems., pp. 2017–2025 (2015)

  17. Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.L.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, pp. 1979–1986 (2014)

  18. Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In: Proceedings of the ACM Multimedia Conference on Multimedia Conference, pp. 792–800 (2018)

  19. Zhou, T., Wang, W., Liu, S., Yang, Y., Gool, L.V.: Spatial transformer networks. In: Proceedings of the 33rd IEEE Conference on Computer Vision and Pattern Recognition, pp. 1622–1631 (2021)

  20. Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H., Lu, C.: Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 10863–10872. Computer Vision Foundation/IEEE (2019)

  21. Fang, H., Xie, S., Tai, Y., Lu, C.: RMPE: regional multi-person pose estimation. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 2353–2362. IEEE Computer Society (2017)

  22. Liu, J., Zhang, Z., Shan, C., Tan, T.: Kinematic skeleton graph augmented network for human parsing. Neurocomputing 413, 457–470 (2020)

    Article  Google Scholar 

  23. Wang, R., Tong, J., Wang, X.: Enhancing feature fusion for human pose estimation. Mach. Vis. Appl. 31(7), 70 (2020)

    Article  Google Scholar 

  24. Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, pp. 6080–6089 (2017)

  25. Xiao, D., Zhong, P.: Image semantic segmentation using deep convolutional nets, fully connected conditional random fields, and dilated convolution. In: Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications, pp. 6080–6089 (2017)

  26. Gui, J., Zhang, H.: Learning rates for multi-task regularization networks. Neurocomputing 466, 243–251 (2021)

    Article  Google Scholar 

  27. Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003 (2016)

  28. Liang, X., Zhou, H., Xing, E.P.: Dynamic-structured semantic propagation network. In: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, pp. 752–761 (2018)

  29. Xu, Y., Piao, Z., Zhang, Z., Liu, W., Gao, S.: Sunnet: a novel framework for simultaneous human parsing and pose estimation. Neurocomputing 444(6), 349–355 (2021)

    Article  Google Scholar 

  30. Yan, X., Chen, Z., Wu, Q.M.J., Lu, M., Sun, L.: 3mnet: multi-task, multi-level and multi-channel feature aggregation network for salient object detection. Mach. Vis. Appl. 32(2), 1–13 (2021)

    Article  Google Scholar 

  31. Kirillov, A., Girshick, R.B., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2019)

  32. Cheng, B., Collins, M.D., Zhu, Y., Liu, T., Huang, T.S., Adam, H., Chen, L.: Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the 32nd IEEE Conference on Computer Vision and Pattern Recognition, pp. 12472–12482 (2020)

  33. Papandreou, G., Zhu, T., Chen, L., Gidaris, S., Tompson, J., Murphy, K.: Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the 15th European Conference on Computer Vision, pp. 282–299 (2018)

  34. Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, pp. 936–944 (2017)

  35. Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the 15th European Conference on Computer Vision, pp. 3–19 (2018)

  36. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the 31st IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)

  37. Brasó, G., Kister, N., Leal-Taixé, L.: The center of attention: center-keypoint grouping via attention for multi-person pose estimation. In: Proceedings of the 18th International Conference on Computer Vision, pp. 11833–11843 (2021)

  38. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W.: Mmdetection: open mmlab detection toolbox and benchmark. CoRR arXiv:1906.07155 (2019)

  39. Li, J., Zhao, J., Lang, C., Li, Y., Wei, Y., Guo, G., Sim, T., Yan, S., Feng, J.: Multi-human parsing with a graph-based generative adversarial model. ACM Trans. Multimed. Comput. Commun. Appl. 17(1), 29:1-29:21 (2021)

    Article  Google Scholar 

  40. Yang, L., Song, Q., Wang, Z., Liu, Z., Xu, S., Li, Z.: Quality-aware network for human parsing. CoRR arXiv:2103.05997 (2021)

  41. Zhang, S., Cao, X., Qi, G., Song, Z., Zhou, J.: Aiparsing: anchor-free instance-level human parsing. IEEE Trans. Image Process. 31, 5599–5612 (2022)

    Article  Google Scholar 

  42. Chen, X., Wang, X., Gao, L., Song, J.: Repparser: end-to-end multiple human parsing with representative parts. CoRR arXiv:2208.12908 (2022)

  43. Crawshaw, M., Kosecká, J.: SLAW: scaled loss approximate weighting for efficient multi-task learning. CoRR arXiv:2109.08218 (2021)

  44. Crawshaw, M.: Multi-task learning with deep neural networks: a survey. CoRR arXiv:2009.09796 (2020)

  45. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482–7491 (2018)

Download references

Acknowledgements

This research was supported by National Natural Science Foundation of China (Grant 62262036, 61962030, 61972185) and Yunnan Provincial Foundation for Leaders of Disciplines in Science and Technology (201905C160046).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, Y., Liu, L., Fu, X. et al. Crowded pose-guided multi-task learning for instance-level human parsing. Machine Vision and Applications 34, 46 (2023). https://doi.org/10.1007/s00138-023-01392-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01392-4

Keywords

Navigation