Elsevier

Pattern Recognition Letters

Volume 149, September 2021, Pages 9-16
Pattern Recognition Letters

Pedestrian instance segmentation with prior structure of semantic parts

https://doi.org/10.1016/j.patrec.2021.05.012Get rights and content

Highlights

  • We propose a simple, easily interpreted semantic part branch for pedestrian instance segmentation.

  • We proposed a pedestrian parts annotations generation method.

  • Our method improves the accuracy of pedestrian instance segmentation on the baseline.

  • Our method does not need to introduce additional annotations.

Abstract

Existing pedestrian segmentation and detection methods often show a significant drop in performance when heavy occlusion and deformation happen because most approaches rely on holistic modeling. Unlike many previous deep models that directly learn a holistic detector, in this paper, we introduce a pedestrian instance segmentation method with a prior structure of semantic parts named Part Mask R-CNN. Based on pedestrian parts’ proportion structure, process the original dataset annotations and then generate parts annotations as prior. By combining the semantic part branch with other classic detection and segmentation branches, the network learns more about pedestrian instances. Besides, we get such a more accurate pedestrian instance segmentation model without any artificial annotations. By extensive evaluations on the Cityscapes dataset, the results demonstrate that the proposed method can improve approaches such as Mask R-CNN, inaccuracy on pedestrian single class instance segmentation.

Introduction

Pedestrian detection and instance segmentation are very active research fields and have attracted distinctive attention in the computer vision community. Instance segmentation can accurately detect the location of each object and provide a pixel-level segmentation mask. Study on pedestrian instance segmentation is valuable since it is an essential step towards many real-world applications, including intelligent surveillance, autonomous driving, and pedestrian retrieval, etc.

The existing instance segmentation methods can be roughly divided into two categories according to the different architectures. One is the proposal based methods that uses two-stage object detection networks as the proposal generation and further segments each proposal to get instance segmentation, like Mask R-CNN [14], PANet [26], BshapeNet [18] and S4Net [10]. The other is segmentation mask-based methods that classify superpixels like SegNet [15], YOLACT [4] and DeepLab series [6], [24]. Since the proposal-based network can provide more accurate object location and classification in most of the remarkable instance segmentation methods, we choose a proposal-based network as our baseline.

The proposal-based network frameworks obtain the region of interest(RoI) from the primary detection network and then produce an RoI mask prediction. However, an RoI may contain multiple objects since overlapping often occurs, but only one object is regarded as foreground. RoI’s foreground is the object that occupies most of the space in the RoI region, and it is the same object used for bounding box regression and classification. The existing proposal-based methods roughly regard other non-foreground objects in RoI as background and ignore the interrelationship between objects. Such an operation results in the mask prediction may cover multiple objects when overlapping the same category or appearance.

Focusing on the problem in the proposal-based framework mentioned above, we have adopted more detailed information on pedestrian components inspired by [17], [31], [38]. In the previous pedestrian detection method using parts, [32] divides pedestrians into different regions according to the top, middle, and bottom, which cannot effectively deal with the left or right occlusion. [30], [31], [37] divide pedestrians into multiple equal-size parts, without considering the difference between them and without further processing according to the features of the part itself. These equal-sized parts partitioning methods sacrifice detailed information about pedestrian structures in exchange for network stability. However, in this paper, we divided the parts according to their proportions to enhance the network’s detailed expression. The stable head and most of the body proportion can adapt to pedestrians’ different posture, ensuring the network’s robustness. Our method considers the detail expression and robustness, which can better help the network identify the body’s relative position in the visible part of the occlusion and improve pedestrian segmentation precision. Besides, to get a more accurate segmentation mask, we train these three branches simultaneously. In this way, these three branches share loss and guide the network to pay more attention to pedestrian parts’ edge details.

In experiments, this method is implemented based on Mask R-CNN [14], the final segmentation results are shown in Tabel 3. The results show that our network has better performance at the edge of the segmentation mask. Experimental verification was carried out only on the Cityscapes dataset since part annotation generation needs to provide groundtruth and the visible box. In some datasets, such as KITTI [12], COCO [23], pixel-level instance segmentation annotations are provided, can use the method in [35] to generate visible box annotations. Still, this method has not been used in other datasets in existing research, so it only shows the results verified on Cityscapes. The Cityscapes dataset’s evaluation results demonstrate that the proposed method improves pedestrian single class instance segmentation accuracy.

We make the following three contributions in this work:

  • i

    A superficial easily interpreted semantic parts branch under complex driving scenes for pedestrian instance segmentation. Although the labeling is not fine enough, the network can learn the parts’ detailed information during the part branch and other branches’ parallel training.

  • ii

    A pedestrian parts annotations generation method is proposed. Accurate semantic parts annotations can be obtained by a reasonable scale template without additional overhead.

  • iii

    An idea to improve pedestrian instance segmentation accuracy under complex driving scenes is put forward. Experimental results on Cityscapes demonstrate the effectiveness of the proposed method.

The rest of the paper is organized as follows: Section 2 illustrates related works. In Section 3, the proposed method’s framework is introduced, followed by the details of the semantic part annotation generate method. Experimental details and results analysis in Section 4, while the conclusion is presented in Section 5.

Section snippets

Instance segmentation

The development of the deep convolutional neural network CNN [19] laid the foundation for image segmentation algorithms. Farabet et al. proposed a multi-scale deep convolutional integration network [11] based on CNN, in which pixels are classified. Although this method has achieved specific image segmentation results, it is necessary to perform clustering operations on pixels in the segmentation process to generate corresponding superpixels and then use the CNN classifier to classify. Each

Part mask R-CNN

The framework of our method is in Fig. 2. The Semantic part prior is conducted for improving performance of pedestrian instance segmentation. This part branch is a binary classification network, which takes each RoI as input and predicts which parts are visible. The parts annotation generation process is shown in Fig. 3(a)(b). Fig(a) is the original annotation of the CityPersons dataset, where the green box represents the groundtruth, and the yellow box is the visible box. Fig (b) is the

Experiments

We evaluate our method on Cityscapes datasets to verify the effectiveness. Following Cityscapes setting, using the 2975 images train split for training, 500 validation split for validation, 1525 test-dev split for test. Adopting Cityscapes evaluation metrics AP(averaged over IoU thresholds) to report the results, including AP@0.5, AP@0.75, and APs, APm, APl(AP at different scales). AP@0.5(or AP@0.75)means using an IoU threshold of 0.5 (or 0.75) to identify whether a predicted bounding box or

Conclusions

In this paper, we present Part Mask R-CNN for pedestrian instance segmentation. The semantic part branch strengthens the network’s attention to the details of the pedestrians part. Therefore, in the case of occlusion, this branch can effectively improve pedestrian instance segmentation’s performance. Moreover, it is easy to implement. On the Cityscapes benchmark, extensive results show that Part Mask R-CNN consistently and outperforms Mask R-CNN. It also can be applied to other instance

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key R&D Plan (No.2016YFB0100901), the Beijing Municipal Science & Technology Project (No.Z191100007419001) and the National Natural Science Foundation of China (No.61773231).

References (38)

  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2016)
  • J. Dai et al.

    Deformable convolutional networks

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • P. Dollar et al.

    Pedestrian detection: an evaluation of the state of the art

    IEEE Trans Pattern Anal Mach Intell

    (2011)
  • R. Fan et al.

    S4net: Single stage salient-instance segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • C. Farabet et al.

    Learning hierarchical features for scene labeling

    IEEE Trans Pattern Anal Mach Intell

    (2012)
  • A. Geiger et al.

    Are we ready for autonomous driving? the kitti vision benchmark suite

    IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • A.W. Harley et al.

    Segmentation-aware convolutional networks using local attention masks

    Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • K. He et al.

    Mask r-cnn

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • J. Hu et al.

    Squeeze-and-excitation networks

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2018)
  • Cited by (0)

    Handle by Associate Editor Jie Zou, Ph.D.

    View full text