Elsevier

Neurocomputing

Volume 402, 18 August 2020, Pages 375-383
Neurocomputing

Semantic-spatial fusion network for human parsing

https://doi.org/10.1016/j.neucom.2020.03.096Get rights and content

Abstract

Recently, many methods have united low-level and high-level features to generate the desired accurate high-resolution prediction for human parsing. Nevertheless, there exists a semantic-spatial gap between low-level and high-level features in some methods, i.e., high-level features represent more semantics and less spatial details, while low-level ones have less semantics and more spatial details. In this paper, we propose a Semantic-Spatial Fusion Network (SSFNet) for human parsing to shrink the gap, which generates the accurate high-resolution prediction by aggregating multi-resolution features. SSFNet includes two models, a semantic modulation model and a resolution-aware model. The semantic modulation model guides spatial details with semantics and then effectively facilitates the feature fusion, narrowing the gap. The resolution-aware model sufficiently boosts the feature fusion and obtains multi-receptive-fields, which generates reliable and fine-grained high-resolution features for each branch, in bottom-up and top-down processes. Extensive experiments on three public datasets, PASCAL-Person-Part, LIP and PPSS, show that SSFNet achieves significant improvements over state-of-the-art methods.

Introduction

Human parsing [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] is a fine-grained semantic segmentation task. It aims to predict the pixel-wise mask of body parts or clothing items for human images. Understanding the details of human images makes sense in some applications, for example, person re-identification [12], human behavior analysis [13], clothing style recognition and retrieval [14], clothing category classification [15], to name a few. However, repeated down-sampling operations of pooling and convolution strides make the prediction lose some details compared to initial images. There are two mainstreams of the low-level and high-level feature fusion networks to obtain the high-resolution prediction. One type of methods [10], [11], [16] employs the “U-net [17]” structure, which fuses high-level and low-level features with skip connections. The other type of methods [18], [19] fuses features by residual connections. The drawback of these above methods is that there is a semantic-spatial gap between features of two different levels [20].

The semantic-spatial gap in the feature fusion is that deep features represent more semantics and less spatial details compared with low features, and vice versa. Consider the extreme case that low-level features only have the capacity to distinguish shallow concepts such as points, lines or edges. Intuitively, it is difficult to fuse high-level features with low-level ones because low-level features are too noisy to provide high-resolution semantic guidance. Similarly, high-level features have little spatial details, and thus, low-level features may not take advantage of the semantics of the high-level. As shown in Fig. 1, some examples can verify the above insights, i.e., there exists the semantic-spatial gap of these predictions generated by MMAN [10] in the second column. Some parts have less spatial details or some spatial details with wrong semantic labels in the second column.

In this paper, we propose a Semantic-Spatial Fusion Network (SSFNet) for human parsing to shrink the gap, which generates an accurate high-resolution prediction. SSFNet mainly includes two models, a semantic modulation model and a resolution-aware model. There are fine-grained gaps compared with semantic segmentation because human parsing segments human bodies into small parts rather than the whole body as done in semantic segmentation. Thus, SSFNet gradually shrinks the fine-grained gap by importing two models in different branches to obtain a coarse-to-fine prediction. Especially, SSFNet fuses multi-resolution features to obtain the desired high-resolution prediction.

The semantic modulation model effectively facilitates the feature fusion between low-level and high-level features, which shrinks the semantic-spatial gap, as shown in Fig. 2 (b). Specifically, our semantic modulation model takes features of two different levels as inputs, and generates features with more semantics and spatial details, in a dual branch structure. As shown in Fig. 2 (d), in each branch, a convolutional layer applies to the high-level features to generate a modulation tensor, which guides low-level spatial details with high-level semantics and makes spatial details have related semantic labels. For example, the spatial details of the head position(i.e., edges, corners) have the head labels. Then, high-level features may have a chance to fuse themselves with semantic spatial details. Hence, the model can alleviate the semantic-spatial gap between low-level and high-level features by effective feature fusion. Moreover, this model not only up-samples high-level features but also down-samples low-level features, which is more robust and accurate than these methods [10], [11], [16] only up-sampling high-level features without regard to low-level features.

In order to obtain more reliable and fine-grained high-resolution features, we present the resolution-aware model, as shown in Fig. 2 (c). This model sufficiently boosts the feature fusion and further shrinks the gap, which can achieve multi-scales and multi-receptive-fields fusion to parse human parts. The ordinary hourglass network [22] centers its attention on one input, whereas our resolution-aware model is different from it. Our model takes two inputs to remedy the missing details along with a series of convolutional operations. Thus, this model has the capacity of extracting deep semantics and keeping the shallow details, and then generates reliable and fine-grained features with different resolutions in different branches, in bottom-up and top-down processes.

Extensive experiments show that SSFNet achieves a new state-of-the-art consistently on three public benchmarks, including PASCAL-Person-Part [21], LIP [8] and PPSS [23]. And LIP can be well generalized on a relatively small dataset PPSS. Specifically, SSFNet outperforms the competing methods by 1.42%, 1.43%, and 5.16% on PASCAL-Person-Part, LIP, and PPSS in terms of mIoU, respectively.

In summary, our contributions can be summarized in three folds:

  • 1.

    We present a Semantic-Spatial Fusion Network (SSFNet) which shrinks the semantic-spatial gap and achieves new state-of-the-art results on three benchmark datasets.

  • 2.

    We propose a semantic modulation model which guides spatial details with semantics and then effectively facilitates feature fusion to narrow the semantic-spatial gap between low-level and high-level features.

  • 3.

    We develop a resolution-aware model which achieves multi-scales and multi-receptive-fields fusion to generate reliable and fine-grained high-resolution features, in bottom-up and top-down processes.

Section snippets

Human parsing

Many research efforts have been devoted to human parsing [4], [5], [10], [11], [16], [19]. Gong et al. [4] presented PGN to fuse semantic features and edge features, which generated the prediction with accurate boundaries. Li et al. [5] proposed a network that fused detectable features with semantic features for human parsing. Nie et al. [19] introduced MuLA network to joint features of human parsing and pose estimation. Liu et al. [11] fused multi-scale features to leverage the useful

Proposed network

In this section, we elaborate on our proposed SSFNet including its overall structure and individual components, as shown in Fig. 2. We first introduce the whole network, then the semantic modulation model, finally the resolution-aware model.

Experiments

To evaluate the performance of the proposed SSFNet, we perform the experiments on three datasets, including PASCAL-Person-Part [21], LIP [8] and PPSS [23]. The first is a single-person and multiple-person dataset, the last two are single-person datasets. The accuracy of each part (clothes) is measured by pixel Intersection-over-Union (IoU) in human parsing. The mean pixel Intersection-over-Union (mIoU) is computed by averaging the IoU across all parts. We use both IoU and mIoU as evaluation

Conclusion

In this paper, we propose a novel CNN architecture for human parsing, Semantic-Spatial Fusion Network (SSFNet), to alleviate the semantic-spatial gap and generate the precise prediction. SSFNet includes two models, a semantic modulation model, and a resolution-aware model. The semantic modulation model narrows the semantic-spatial gap between the high-level and low-level features by exploring the mutual information and outputs semantic-spatial features of two resolutions, where these maps learn

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Xiaomei Zhang: Conceptualization, Methodology, Software, Validation, Writing - original draft. Yingying Chen: Writing - review & editing. Bingke Zhu: Data curation, Visualization, Writing - review & editing. Jinqiao Wang: Writing - review & editing. Ming Tang: Writing - review & editing.

Acknowledgments

This work was supported by National Natural Science Foundation of China 61976210, 61772527.

Xiaomei Zhang received her B.S. degree from North China Electric Power University, China, in 2016. She is currently working toward the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image processing, and semantic segmentation.

References (45)

  • F. Cheng et al.

    Leveraging semantic segmentation with learning-based confidence measure

    Neurocomputing

    (2019)
  • F. Shen et al.

    Semantic image segmentation via guidance of image classification

    Neurocomputing

    (2019)
  • P. Wang et al.

    Joint object and part segmentation using deep learned potentials

    IEEE International Conference on Computer Vision

    (2015)
  • F. Xia et al.

    Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net

    IEEE International Conference on Computer Vision

    (2015)
  • F. Xia et al.

    Joint multi-person pose estimation and semantic part segmentation

    IEEE International on Computer Vision and Pattern Recognition Workshops

    (2017)
  • K. Gong et al.

    Instance-level human parsing via part grouping network

    European Conference on Computer Vision

    (2018)
  • Q. Li, A. Arnab, P.H. Torr, Holistic, instance-level human parsing, British Machine Vision Conference,...
  • X. Liang et al.

    Interpretable structure-evolving LSTM

    IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • X. Liang et al.

    Semantic object parsing with local-global long short-term memory

    IEEE International on Computer Vision and Pattern Recognition

    (2016)
  • K. Gong et al.

    Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing.

    CVPR

    (2017)
  • B. Zhu et al.

    Progressive cognitive human parsing

    (2018)
  • Y. Luo et al.

    Macro-micro adversarial network for human parsing

    European Conference on Computer Vision

    (2018)
  • T. Liu et al.

    Devil in the details: towards accurate single and multiple human parsing

    AAAI Conference on Artificial Intelligence

    (2019)
  • M. Farenzena et al.

    Person re-identification by symmetry-driven accumulation of local features

    IEEE International on Computer Vision and Pattern Recognition

    (2010)
  • Y. Wang et al.

    Discriminative hierarchical part-based models for human parsing and action recognition

    J. Mach. Learn. Res.

    (2012)
  • K. Yamaguchi et al.

    Paper doll parsing: retrieving similar styles to parse clothing items

    IEEE International Conference on Computer Vision

    (2014)
  • W. Wang et al.

    Attentive fashion grammar network for fashion landmark detection and clothing category classification

    IEEE International on Computer Vision and Pattern Recognition

    (2018)
  • X. Liang et al.

    Human parsing with contextualized convolutional neural network

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • O. Ronneberger et al.

    U-net: convolutional networks for biomedical image segmentation

    International Conference on Medical Image Computing and Computer-Assisted Intervention

    (2015)
  • G. Lin et al.

    Refinenet: multi-path refinement networks for high-resolution semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • X. Nie et al.

    Mutual learning to adapt for joint human parsing and pose estimation

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • Z. Zhang et al.

    Exfuse: enhancing feature fusion for semantic segmentation

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • Cited by (11)

    • MVSN: A Multi-view stack network for human parsing

      2021, Neurocomputing
      Citation Excerpt :

      The problem encountered in semantic segmentation tasks is to obtain high-level abstract semantics as much as possible without losing the semantic information of detailed textures. Zhang et al. [13] proposed a Semantic-Spatial Fusion Network (SSFNet) to shrink the semantic-spatial gap between low-level and high-level features. Chen et al. [14] integrated the atrous convolution structure on the basis of spatial pyramid pooling and proposed Atrous Spatial Pyramid Pooling (ASPP).

    • Nondiscriminatory treatment: A straightforward framework for multi-human parsing

      2021, Neurocomputing
      Citation Excerpt :

      There are two directions in all, one is on the level of semantic segmentation, and the other is on instance segmentation. In the first direction, SSFNet [28] aims at the problem of non-negligible gap between high-level and low-level features and deals with it by feature fusion. SUNNet [29] realizes the mutual benefit of pose and parsing and utilizes a combination of feature separation stage and feature union stage to facilitate the training process.

    • Context Prior based Semantic-Spatial Graph Network for Human Parsing

      2021, Neurocomputing
      Citation Excerpt :

      Liang et al. [24] proposed an efficient context modelling to further explore and leverage the semantic correlation of the human parsing and pose estimation. Zhang et al. [36] captured the contextual information by fusing the multi-scales features to shrink the semantic-spatial gap. Different from exploring the contextual information to achieve effective results, Ruan et al. [37] proposed an edge perceiving module to constrain human contours and guide human parsing.

    View all citing articles on Scopus

    Xiaomei Zhang received her B.S. degree from North China Electric Power University, China, in 2016. She is currently working toward the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image processing, and semantic segmentation.

    Yingying Chen received her B.S. degree in 2013 from Communication University of China, and Ph.D. degree in 2018 from University of Chinese Academy of Sciences. She is currently an assistant professor in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image and video processing, and intelligent video surveillance.

    Bingke Zhu received his B.S. degree from Beijing University of Chemical Technology, China, in 2016. He is currently working toward the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. His current research interests include pattern recognition and machine learning, alpha matting, semantic segmentation, and instance segmentation.

    Jinqiao Wang received the B.E. degree in 2001 from Hebei University of Technology, China, and the M.S. degree in 2004 from Tianjin University, China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2008. He is currently a Professor with Chinese Academy of Sciences. His research interests include pattern recognition and machine learning, image and video processing, mobile multimedia, and intelligent video surveillance.

    Ming Tang received the B.S. degree in computer science and engineering and M.S. degree in artificial intelligence from Zhejiang University, Hangzhou, China, in 1984 and 1987, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the Chinese Academy of Sciences, Beijing, China, in 2002. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and machine learning.

    View full text