Skip to main content

Advertisement

Log in

Fine-Grained Multi-human Parsing

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification, e-commerce, media editing, video surveillance, autonomous driving and virtual reality, etc. To perform well, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we first present a new large-scale database “Multi-human Parsing (MHP v2.0)” for algorithm development and evaluation to advance the research on understanding humans in crowded scenes. MHP v2.0 contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels and 16 dense pose key point labels, involving 2–26 persons per image captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, including MHP v1.0, PASCAL-Person-Part and Buffy. NAN serves as a strong baseline to shed light on generic instance-level semantic part prediction and drive the future research on multi-human parsing. With the above innovations and contributions, we have organized the CVPR 2018 Workshop on Visual Understanding of Humans in Crowd Scene (VUHCS 2018) and the Fine-Grained Multi-human Parsing and Pose Estimation Challenge. These contributions together significantly benefit the community. Code and pre-trained models are available at https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.

  2. https://vuhcs.github.io/#portfolio.

  3. http://lv-mhp.github.io/.

  4. PASCAL-VOC-2012 (Everingham et al. 2015) and Microsoft COCO (Lin et al. 2014) are not included due to limited percent of crowd-scene images with fine details of persons.

  5. The trainable parameters of each stage (each sub-net) are mainly learned through the losses of the corresponding stage. However, they can still be adjusted to some degree by the losses of their subsequent stages due to the nested structure during gradient BP.

  6. As existing instance segmentation methods only offer silhouettes of different person instances, for comparison, we combine them with our instance-agnostic parsing prediction to generate the final multi-human parsing results.

  7. We adopt CRF as a post-processing step to refine the instance-agnostic parsing map by associating each pixel in the image with one of the semantic categories.

  8. For each testing image, we calculate the pair-wise instance bounding box IoU and use the mean value as the interaction intensity for each image.

  9. Since Mask R-CNN only offer silhouettes of different person instances, we did not compare the speed with it for multi-human parsing.

  10. The dataset is available at http://lv-mhp.github.io/.

  11. The dataset is available at http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html.

  12. The dataset is available at https://www.inf.ethz.ch/personal/ladickyl/Buffy.zip.

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M. et al. (2016). Tensorflow: A system for large-scale machine learning.

  • Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. T-PAMI, 33(5), 898–916.

    Article  Google Scholar 

  • Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR (pp. 1971–1978).

  • Chen, L.-C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In CVPR (pp. 3640–3649).

  • Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In ICCV (pp. 3352–3360).

  • Collins, R. T., Lipton, A. J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P. et al. (2000). A system for video surveillance and monitoring. VSAM final report (pp. 1–68).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR (pp. 3213–3223).

  • Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In CVPR (pp. 3150–3158).

  • De Brabandere, B., Neven, D., & Van Gool, L. (2017). Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551.

  • Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. T-PAMI, 34(4), 743–761.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2011). The PASCAL visual object classes challenge 2011 (VOC2011) results. Retrieved May 25, 2011 from http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html.

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR (pp. 1–8).

  • Gan, C., Lin, M., Yang, Y., de Melo, G., & Hauptmann, A. G. (2016). Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In AAAI (p. 3487).

  • Girshick, R. (2015). Fast R-CNN. arXiv preprint arXiv:1504.08083.

  • Gong, K., Liang, X., Shen, X., & Lin, L. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. arXiv preprint arXiv:1703.05446.

  • Hariharan, B., Arbeláez, P., R. Girshick, P., & Malik, J. (2014). Simultaneous detection and segmentation. In ECCV (pp. 297–312).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In ICCV (pp. 2980–2988).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).

  • Jiang, H., & Grauman, K. (2016). Detangling people: Individuating multiple close people and their body parts via region assembly. arXiv preprint arXiv:1604.03880

  • Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Grother, P., Mah, A., & Jain, A.K. (2015). Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In CVPR (pp. 1931–1939).

  • Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.

  • Li, Q., Arnab, A., & Torr, P. H. (2017a). Holistic, instance-level human parsing. arXiv preprint arXiv:1709.03612.

  • Li, G., Xie, Y., Lin, L., & Yu, Y. (2017b). Instance-level salient object segmentation. In CVPR (pp. 247–256).

  • Li, J., Zhao, J., Wei, Y., Lang , C., Li, Y., Sim, T., Yan, S., & Feng, J. (2017c). Multi-human parsing in the wild. arXiv preprint arXiv:1705.07206.

  • Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., & Yan, S. (2015a). Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636.

  • Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., & Yan, S. (2015b). Human parsing with contextualized convolutional neural network. In ICCV (pp. 1386–1394).

  • Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., & Zhu, S.-C. (2016). A virtual reality platform for dynamic human-scene interaction. In SIGGRAPH (p. 11).

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755).

  • Liu, S., Wang, C., Qian, R., Yu, H., Bao, R., & Sun, Y. (2017). Surveillance video parsing with single frame supervision. In CVPRW (pp. 1–9).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).

  • Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In NIPS (pp. 849–856).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In CVPR (pp. 3674–3681).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Turban, E., King, D., Lee, J., & Viehland, D. (2002). Electronic commerce: A managerial perspective 2002 (Vol. 13, no. (975285), p. 4). Englewood Cliffs: Prentice Hall.

    Google Scholar 

  • Vineet, V., Warrell, J., Ladicky, L., & Torr, P. H. (2011). Human instance segmentation from video using detector-based conditional random fields. In BMVC (Vol. 2, pp. 12–15).

  • Wu, Z., Shen, C., Van Den Hengel, A. (2016). Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080.

  • Xia, F., Wang, P., Chen, L.-C., & Yuille, A. L. (2016). Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV (pp. 648–663).

  • Xu, N., Price, B., Cohen, S., Yang, J., & Huang, T. S. (2016). Deep interactive object selection. In CVPR (pp. 373–381).

  • Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., & Berg, T. L. (2012). Parsing clothing in fashion photographs. In CVPR (pp. 3570–3577).

  • Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2018). From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision, 126(5), 550–569.

    Article  MathSciNet  Google Scholar 

  • Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L. (2015). Beyond frontal faces: Improving person recognition using multiple cues. In CVPR (pp. 4804–4813).

  • Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., & Feng, J. (2018). Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In 2018 ACM Multimedia Conference on Multimedia Conference (pp. 792–800). ACM.

  • Zhao, J., Li, J., Nie, X., Zhao, F., Chen, Y., Wang, Z., Feng, J., & Yan, S. (2017). Self-supervised neural aggregation networks for human parsing. In CVPRW (pp. 7–15).

  • Zhao, R., Ouyang, W., & Wang, X. (2013). Unsupervised salience learning for person re-identification. In CVPR (pp. 3586–3593).

Download references

Acknowledgements

The work of Jian Zhao was partially supported by China Scholarship Council (CSC) Grant 201503170248. The work of Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Zhao.

Additional information

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, J., Li, J., Liu, H. et al. Fine-Grained Multi-human Parsing. Int J Comput Vis 128, 2185–2203 (2020). https://doi.org/10.1007/s11263-019-01181-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01181-5

Keywords

Navigation