Semantic-spatial fusion network for human parsing

doi:10.1016/j.neucom.2020.03.096

Neurocomputing

Volume 402, 18 August 2020, Pages 375-383

https://doi.org/10.1016/j.neucom.2020.03.096 Get rights and content

Abstract

Recently, many methods have united low-level and high-level features to generate the desired accurate high-resolution prediction for human parsing. Nevertheless, there exists a semantic-spatial gap between low-level and high-level features in some methods, i.e., high-level features represent more semantics and less spatial details, while low-level ones have less semantics and more spatial details. In this paper, we propose a Semantic-Spatial Fusion Network (SSFNet) for human parsing to shrink the gap, which generates the accurate high-resolution prediction by aggregating multi-resolution features. SSFNet includes two models, a semantic modulation model and a resolution-aware model. The semantic modulation model guides spatial details with semantics and then effectively facilitates the feature fusion, narrowing the gap. The resolution-aware model sufficiently boosts the feature fusion and obtains multi-receptive-fields, which generates reliable and fine-grained high-resolution features for each branch, in bottom-up and top-down processes. Extensive experiments on three public datasets, PASCAL-Person-Part, LIP and PPSS, show that SSFNet achieves significant improvements over state-of-the-art methods.

Introduction

Human parsing [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] is a fine-grained semantic segmentation task. It aims to predict the pixel-wise mask of body parts or clothing items for human images. Understanding the details of human images makes sense in some applications, for example, person re-identification [12], human behavior analysis [13], clothing style recognition and retrieval [14], clothing category classification [15], to name a few. However, repeated down-sampling operations of pooling and convolution strides make the prediction lose some details compared to initial images. There are two mainstreams of the low-level and high-level feature fusion networks to obtain the high-resolution prediction. One type of methods [10], [11], [16] employs the “U-net [17]” structure, which fuses high-level and low-level features with skip connections. The other type of methods [18], [19] fuses features by residual connections. The drawback of these above methods is that there is a semantic-spatial gap between features of two different levels [20].

The semantic-spatial gap in the feature fusion is that deep features represent more semantics and less spatial details compared with low features, and vice versa. Consider the extreme case that low-level features only have the capacity to distinguish shallow concepts such as points, lines or edges. Intuitively, it is difficult to fuse high-level features with low-level ones because low-level features are too noisy to provide high-resolution semantic guidance. Similarly, high-level features have little spatial details, and thus, low-level features may not take advantage of the semantics of the high-level. As shown in Fig. 1, some examples can verify the above insights, i.e., there exists the semantic-spatial gap of these predictions generated by MMAN [10] in the second column. Some parts have less spatial details or some spatial details with wrong semantic labels in the second column.

In this paper, we propose a Semantic-Spatial Fusion Network (SSFNet) for human parsing to shrink the gap, which generates an accurate high-resolution prediction. SSFNet mainly includes two models, a semantic modulation model and a resolution-aware model. There are fine-grained gaps compared with semantic segmentation because human parsing segments human bodies into small parts rather than the whole body as done in semantic segmentation. Thus, SSFNet gradually shrinks the fine-grained gap by importing two models in different branches to obtain a coarse-to-fine prediction. Especially, SSFNet fuses multi-resolution features to obtain the desired high-resolution prediction.

The semantic modulation model effectively facilitates the feature fusion between low-level and high-level features, which shrinks the semantic-spatial gap, as shown in Fig. 2 (b). Specifically, our semantic modulation model takes features of two different levels as inputs, and generates features with more semantics and spatial details, in a dual branch structure. As shown in Fig. 2 (d), in each branch, a convolutional layer applies to the high-level features to generate a modulation tensor, which guides low-level spatial details with high-level semantics and makes spatial details have related semantic labels. For example, the spatial details of the head position(i.e., edges, corners) have the head labels. Then, high-level features may have a chance to fuse themselves with semantic spatial details. Hence, the model can alleviate the semantic-spatial gap between low-level and high-level features by effective feature fusion. Moreover, this model not only up-samples high-level features but also down-samples low-level features, which is more robust and accurate than these methods [10], [11], [16] only up-sampling high-level features without regard to low-level features.

In order to obtain more reliable and fine-grained high-resolution features, we present the resolution-aware model, as shown in Fig. 2 (c). This model sufficiently boosts the feature fusion and further shrinks the gap, which can achieve multi-scales and multi-receptive-fields fusion to parse human parts. The ordinary hourglass network [22] centers its attention on one input, whereas our resolution-aware model is different from it. Our model takes two inputs to remedy the missing details along with a series of convolutional operations. Thus, this model has the capacity of extracting deep semantics and keeping the shallow details, and then generates reliable and fine-grained features with different resolutions in different branches, in bottom-up and top-down processes.

Extensive experiments show that SSFNet achieves a new state-of-the-art consistently on three public benchmarks, including PASCAL-Person-Part [21], LIP [8] and PPSS [23]. And LIP can be well generalized on a relatively small dataset PPSS. Specifically, SSFNet outperforms the competing methods by 1.42%, 1.43%, and 5.16% on PASCAL-Person-Part, LIP, and PPSS in terms of mIoU, respectively.

In summary, our contributions can be summarized in three folds:

1.
We present a Semantic-Spatial Fusion Network (SSFNet) which shrinks the semantic-spatial gap and achieves new state-of-the-art results on three benchmark datasets.
2.
We propose a semantic modulation model which guides spatial details with semantics and then effectively facilitates feature fusion to narrow the semantic-spatial gap between low-level and high-level features.
3.
We develop a resolution-aware model which achieves multi-scales and multi-receptive-fields fusion to generate reliable and fine-grained high-resolution features, in bottom-up and top-down processes.

Section snippets

Human parsing

Many research efforts have been devoted to human parsing [4], [5], [10], [11], [16], [19]. Gong et al. [4] presented PGN to fuse semantic features and edge features, which generated the prediction with accurate boundaries. Li et al. [5] proposed a network that fused detectable features with semantic features for human parsing. Nie et al. [19] introduced MuLA network to joint features of human parsing and pose estimation. Liu et al. [11] fused multi-scale features to leverage the useful

Proposed network

In this section, we elaborate on our proposed SSFNet including its overall structure and individual components, as shown in Fig. 2. We first introduce the whole network, then the semantic modulation model, finally the resolution-aware model.

Experiments

To evaluate the performance of the proposed SSFNet, we perform the experiments on three datasets, including PASCAL-Person-Part [21], LIP [8] and PPSS [23]. The first is a single-person and multiple-person dataset, the last two are single-person datasets. The accuracy of each part (clothes) is measured by pixel Intersection-over-Union (IoU) in human parsing. The mean pixel Intersection-over-Union (mIoU) is computed by averaging the IoU across all parts. We use both IoU and mIoU as evaluation

Conclusion

In this paper, we propose a novel CNN architecture for human parsing, Semantic-Spatial Fusion Network (SSFNet), to alleviate the semantic-spatial gap and generate the precise prediction. SSFNet includes two models, a semantic modulation model, and a resolution-aware model. The semantic modulation model narrows the semantic-spatial gap between the high-level and low-level features by exploring the mutual information and outputs semantic-spatial features of two resolutions, where these maps learn

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Xiaomei Zhang: Conceptualization, Methodology, Software, Validation, Writing - original draft. Yingying Chen: Writing - review & editing. Bingke Zhu: Data curation, Visualization, Writing - review & editing. Jinqiao Wang: Writing - review & editing. Ming Tang: Writing - review & editing.

Acknowledgments

This work was supported by National Natural Science Foundation of China 61976210, 61772527.

Xiaomei Zhang received her B.S. degree from North China Electric Power University, China, in 2016. She is currently working toward the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image processing, and semantic segmentation.

References (45)

F. Cheng et al.
Leveraging semantic segmentation with learning-based confidence measure
Neurocomputing
(2019)
F. Shen et al.
Semantic image segmentation via guidance of image classification
Neurocomputing
(2019)
P. Wang et al.
Joint object and part segmentation using deep learned potentials
IEEE International Conference on Computer Vision
(2015)
F. Xia et al.
Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net
IEEE International Conference on Computer Vision
(2015)
F. Xia et al.
Joint multi-person pose estimation and semantic part segmentation
IEEE International on Computer Vision and Pattern Recognition Workshops
(2017)
K. Gong et al.
Instance-level human parsing via part grouping network
European Conference on Computer Vision
(2018)
Q. Li, A. Arnab, P.H. Torr, Holistic, instance-level human parsing, British Machine Vision Conference,...
X. Liang et al.
Interpretable structure-evolving LSTM
IEEE Conference on Computer Vision and Pattern Recognition
(2017)
X. Liang et al.
Semantic object parsing with local-global long short-term memory
IEEE International on Computer Vision and Pattern Recognition
(2016)
K. Gong et al.
Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing.
CVPR
(2017)

B. Zhu et al.

Progressive cognitive human parsing

(2018)

Y. Luo et al.

Macro-micro adversarial network for human parsing

European Conference on Computer Vision

(2018)

T. Liu et al.

Devil in the details: towards accurate single and multiple human parsing

AAAI Conference on Artificial Intelligence

(2019)

M. Farenzena et al.

Person re-identification by symmetry-driven accumulation of local features

IEEE International on Computer Vision and Pattern Recognition

(2010)

Y. Wang et al.

Discriminative hierarchical part-based models for human parsing and action recognition

J. Mach. Learn. Res.

(2012)

K. Yamaguchi et al.

Paper doll parsing: retrieving similar styles to parse clothing items

IEEE International Conference on Computer Vision

(2014)

W. Wang et al.

Attentive fashion grammar network for fashion landmark detection and clothing category classification

IEEE International on Computer Vision and Pattern Recognition

(2018)

X. Liang et al.

Human parsing with contextualized convolutional neural network

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

O. Ronneberger et al.

U-net: convolutional networks for biomedical image segmentation

International Conference on Medical Image Computing and Computer-Assisted Intervention

(2015)

G. Lin et al.

Refinenet: multi-path refinement networks for high-resolution semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

X. Nie et al.

Mutual learning to adapt for joint human parsing and pose estimation

Proceedings of the European Conference on Computer Vision (ECCV)

(2018)

Z. Zhang et al.

Exfuse: enhancing feature fusion for semantic segmentation

Proceedings of the European Conference on Computer Vision (ECCV)

(2018)

Cited by (11)

MVSN: A Multi-view stack network for human parsing
2021, Neurocomputing
Citation Excerpt :
The problem encountered in semantic segmentation tasks is to obtain high-level abstract semantics as much as possible without losing the semantic information of detailed textures. Zhang et al. [13] proposed a Semantic-Spatial Fusion Network (SSFNet) to shrink the semantic-spatial gap between low-level and high-level features. Chen et al. [14] integrated the atrous convolution structure on the basis of spatial pyramid pooling and proposed Atrous Spatial Pyramid Pooling (ASPP).
The human parsing task is dedicated to allocating pixel-level fine-grained semantic labels. However, recent solutions are limited to fully utilize the prior information (poses and edges) and the potential information of the image data, remaining problems with ambiguous boundaries, incomplete human parts, and redundant labels. To solve the above problems, we propose a novel Multi-view Stack Network (MVSN), which is constructed by three stacks with multiple views of features included parts, edges, and pre-segmentation. Meanwhile, a channel correlator is developed to acquire the correlation between local and global information better. Comprehensive experiments and corresponding results on three public datasets show that the proposed MVSN performs favorably against the state-of-the-art methods.
Nondiscriminatory treatment: A straightforward framework for multi-human parsing
2021, Neurocomputing
Citation Excerpt :
There are two directions in all, one is on the level of semantic segmentation, and the other is on instance segmentation. In the first direction, SSFNet [28] aims at the problem of non-negligible gap between high-level and low-level features and deals with it by feature fusion. SUNNet [29] realizes the mutual benefit of pose and parsing and utilizes a combination of feature separation stage and feature union stage to facilitate the training process.
Multi-human parsing aims to segment every body part of every human instance. Nearly all state-of-the-art methods follow the “detection first” or “segmentation first” pipelines. Different from them, we present an end-to-end and box-free pipeline from a new and more human-intuitive perspective. In training time, we directly do instance segmentation on humans and parts. More specifically, we introduce a notion of “indiscriminate objects with categories” which treats humans and parts without distinction and regards them both as instances with categories. In the mask prediction, each binary mask is obtained by a combination of prototypes shared among all human and part categories. In inference time, we design a brand-new grouping post-processing method that relates each part instance with one single human instance and groups them together to obtain the final human-level parsing result. We name our method as Nondiscriminatory Treatment between Humans and Parts for Human Parsing (NTHP). Experiments show that our network performs superiorly against state-of-the-art methods by a large margin on the MHP v2.0 and PASCAL-Person-Part datasets.
Context Prior based Semantic-Spatial Graph Network for Human Parsing
2021, Neurocomputing
Citation Excerpt :
Liang et al. [24] proposed an efficient context modelling to further explore and leverage the semantic correlation of the human parsing and pose estimation. Zhang et al. [36] captured the contextual information by fusing the multi-scales features to shrink the semantic-spatial gap. Different from exploring the contextual information to achieve effective results, Ruan et al. [37] proposed an edge perceiving module to constrain human contours and guide human parsing.
Recently, many methods enlarged the kernel size or fused multi-scale features to capture contextual information for parsing the hard samples, such as heavy occlusions, limb absence, and complex poses. Nevertheless, those required higher computational and cannot explore the effective global context to achieve more precise parsing results. In this paper, we propose an end-to-end network for human parsing, named Context Prior based Semantic-Spatial Graph Network (CP-SSGNet), which achieves higher precision and consistency by encoding the context prior in the graph model. CP-SSGNet consists of three modules, Semantic Constraint Module (SCM), Spatial Perceiving Module (SPM), and Intra-class Attention Module (IAM). SCM captures richer global dependencies by encoding semantic structure context prior in a semantic graph, which can reduce semantic structure errors. SPM learns enhanced local features by encoding the spatial consistency context prior in the spatial graph, which can optimize boundaries and reduce the local consistency errors. IAM utilizes the spatial graph with strong–weak connections for intra-class features aggregation to distinguish the inter-class features clearly. Extensive experiments are conducted on two challenging datasets, PASCAL-Person-Part and LIP, effectively achieving state-of-the-arts performance.
GCAENet: global-class context with advanced edge network for single human parsing
2023, Visual Computer
MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval
2023, Multimedia Tools and Applications
Pose-Guided Hierarchical Semantic Decomposition and Composition for Human Parsing
2023, IEEE Transactions on Cybernetics

View all citing articles on Scopus

Yingying Chen received her B.S. degree in 2013 from Communication University of China, and Ph.D. degree in 2018 from University of Chinese Academy of Sciences. She is currently an assistant professor in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image and video processing, and intelligent video surveillance.

Bingke Zhu received his B.S. degree from Beijing University of Chemical Technology, China, in 2016. He is currently working toward the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences. His current research interests include pattern recognition and machine learning, alpha matting, semantic segmentation, and instance segmentation.

Jinqiao Wang received the B.E. degree in 2001 from Hebei University of Technology, China, and the M.S. degree in 2004 from Tianjin University, China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2008. He is currently a Professor with Chinese Academy of Sciences. His research interests include pattern recognition and machine learning, image and video processing, mobile multimedia, and intelligent video surveillance.

Ming Tang received the B.S. degree in computer science and engineering and M.S. degree in artificial intelligence from Zhejiang University, Hangzhou, China, in 1984 and 1987, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the Chinese Academy of Sciences, Beijing, China, in 2002. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and machine learning.

View full text

Semantic-spatial fusion network for human parsing

Abstract

Introduction

Section snippets

Human parsing

Proposed network

Experiments

Conclusion

Declaration of Competing Interests

CRediT authorship contribution statement

Acknowledgments

Neurocomputing

Neurocomputing

Joint object and part segmentation using deep learned potentials

IEEE International Conference on Computer Vision

Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net

IEEE International Conference on Computer Vision

Joint multi-person pose estimation and semantic part segmentation

IEEE International on Computer Vision and Pattern Recognition Workshops

Instance-level human parsing via part grouping network

European Conference on Computer Vision

Interpretable structure-evolving LSTM

IEEE Conference on Computer Vision and Pattern Recognition

Semantic object parsing with local-global long short-term memory

IEEE International on Computer Vision and Pattern Recognition

Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing.

CVPR

Progressive cognitive human parsing

Macro-micro adversarial network for human parsing

European Conference on Computer Vision

Devil in the details: towards accurate single and multiple human parsing

AAAI Conference on Artificial Intelligence

Person re-identification by symmetry-driven accumulation of local features

IEEE International on Computer Vision and Pattern Recognition

Discriminative hierarchical part-based models for human parsing and action recognition

J. Mach. Learn. Res.

Paper doll parsing: retrieving similar styles to parse clothing items

IEEE International Conference on Computer Vision

Attentive fashion grammar network for fashion landmark detection and clothing category classification

IEEE International on Computer Vision and Pattern Recognition

Human parsing with contextualized convolutional neural network

IEEE Trans. Pattern Anal. Mach. Intell.

U-net: convolutional networks for biomedical image segmentation

International Conference on Medical Image Computing and Computer-Assisted Intervention

Refinenet: multi-path refinement networks for high-resolution semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Mutual learning to adapt for joint human parsing and pose estimation

Proceedings of the European Conference on Computer Vision (ECCV)

Exfuse: enhancing feature fusion for semantic segmentation

Proceedings of the European Conference on Computer Vision (ECCV)