Keywords

1 Introduction

Recently, instance semantic segmentation (i.e., simultaneous detection and segmentation, SDS [6]) has become an attractive object recognition goal. It combines elements from object detection and semantic segmentation. SDS is therefore much more challenging because it requires precise detection and correct segmentation of all objects in an image. SDS can be widely used in various fields, such as automatic driving, surveillance, visual question answering and robot obstacle avoidance, to name a few.

Instance semantic segmentation seeks to classify the semantic category of each pixel and relate each pixel to a physical instance. Related approaches are roughly divided into three streams. The first is based on segment-proposal. For example, Hariharan et al. [6] developed pioneer work. It started with category-independent bottom-up object proposals generated by MCG [10]. It then extracted features from both the bounding box of the region and the foreground of the region using convolution neural networks (CNNs). Finally, the concatenated features were classified by SVM. The second is built upon Markov random field model (MRF). In [13], the authors used CNNs in conjunction with a global densely connected MRF to perform local object disambiguation and derive a globally consistent labels of the entire image. The last kind of methods [11, 12] is to use recurrent neural network (RNN) for dense prediction. [11] presented an end-to-end RNN architecture with visual attention mechanism to perform instance segmentation. Typically, the segment-proposal based approaches have dominated the field of SDS.

However, those segment-proposal based methods have three drawbacks. First, the RoI (region of interest) pooling layer loses detailed spatial structure information because of feature pooling and resizing, which, however, is important to get fixed-size feature representation for fully-connected (fc) layers [3]. The object in an image may be mis-predicted or the pixels that belong to the same object may have inconsistent labels due to the fixed-size representation. Second, the size of object proposals greatly influences the performance of segmentation. The proposal-based approaches assume that instances are almost already in proposals and what they only need to do is to segment them out. Such characteristics cause the small instances not to be searched. Third, the proposals also contain much noise regarding other instances. The proposals not only contain the instances we are interested but also contain the other objects we are not interested.

In this paper, we present a novel segmentation framework, called Scale-aware Patch Fusion Network (SPF), as shown in Fig. 1. Studies of mid-level representation demonstrate that it is helpful to extract more structural information and model instance variation in local patches. Motivated by the spirit of mid-level representation and multi-scale orderless pooling [4], the proposed SPF accepts multiple scale patches as inputs, followed by a flexible patch fusion algorithm. Our system regards different patches as different semantic parts of the entire instance. Experiments on VOC 2012 segmentation val, VOC 2012 SDS val, MS COCO demonstrate excellent performance using the end-to-end training deep VGG-16 model (Titan X).

In addition, our framework to solve the SDS problem makes the following main contributions.

  1. (1)

    We propose the strategy to generate the multi-scale patches for instance parsing.

  2. (2)

    We develop an efficient algorithm to infer the segmentation mask for each instance by merging information from mid-level patches.

  3. (3)

    We capture much more detail and discriminative information via different patches.

Fig. 1.
figure 1

The SPF framework for instance segmentation. S represents the small-level scale patch, M denotes the middle-level patch, and L is the large-level scale patch. We segment the different levels of scales patches, then merge the segmentation results via a new fusion algorithm. However, we just show the middle result of S scale in the framework.

2 The Proposed Method

Figure 1 shows the overall architecture of the deep SPF network. The key components of the framework are multi-scale patch generator, semantic segmentation network and the patch fusion algorithm. First, multiple scales of patches are generated, and then the local patches are segmented and classified via a multi-task segmentation network. Finally, the predicted results are fused by the new patch fusion algorithm.

2.1 Multi-scale Patch Generator

In this section, we describe how to generate multiple scales of patches from the original image. In this paper, we use three scales of patches, i.e., 64\(\,\times \,\)64, 96\(\,\times \,\)96, and 256\(\,\times \,\)256. We first normalize all the images to the same scale of 256\(\,\times \,\)256, then generate three scales of patches for each normalized image. The coarsest scale is the whole image with the global spatial information preserved. For the other two scales, we extract 64\(\,\times \,\)64, 96\(\,\times \,\)96 patches to capture more local and fine-grained information. To filter out redundant patches, we must reselect from the overlapping patches according to the following two constrains:

  1. (1)

    The patch center is overlapped with the instance center.

  2. (2)

    The area of \(P_i\) is half larger than that of \(I_i\).

\(P _i\) denotes the i-th patch and \(I _i\) represents the i-th instance of the patch. We choose patches that satisfy the above two constrains, and store them in \({\varGamma _1}\). And represent each patch with a four-tuple (r, c, h, w), where h and w are height and width, respectively, while r and c are the coordinates of its top-left corner.

2.2 Semantic Segmentation Network

After reselecting patches, a cascade segmentation network is employed to get the corresponding patch segmentation results. Our cascade segmentation model contains three stages: differentiating instances, regressing mask-level instances, and categorizing instances.

Differentiating Instances. In the first stage, we use the Region Proposal Networks (RPNs) which has two sibling 1\(\,\times \,\)1 convolutional layers for box regression and object classification. The loss function of this stage is defined as follows:

$$\begin{aligned} {L_{RPN}}(\varTheta ) = {L_{cls}}(B(\varTheta )) + \lambda {L_{reg}}(B(\varTheta )) \end{aligned}$$
(1)

Here \(\varTheta \) and B separately denote the network parameters and the outputs of the first stage. The boxes list: \(B = \{ {B_i}\} \) and \({B_i} = \{ {x_i},{y_i},{w_i},{h_i},{p_i}\} \), where \({B_i}\) is a box indexed by i and (\({x_i}, {y_i}\)) is the coordinate of its center. \({w_i}\) and \({h_i}\) represent the width and the height, respectively. \({p_i}\) is the predicted objectness probability. In this paper, the balance weight \(\lambda \) is set to 1. The Eq. (1) indicates that the cost of stage-1 is the function of network parameters \(\varTheta \).

Regressing Mask-Level Instances. In the second stage, the inputs are the shared convolutional features and the regressed bounding boxes. It outputs a pixel-level mask for each RoI. For generating fixed-size representation and being differentiable to the box coordinate, we perform the RoI pooling by a new strategy. Firstly, we use the RoI warping layer to crop a feature map region and warp it into a fixed-size (14\(\,\times \,\)14) by bilinear interpolation. Then we perform standard max pooling after the warping operation. The RoI warping operation can be described as:

$$\begin{aligned} {F_i}^{RoI}(\varTheta ) = G({B_i}(\varTheta ))F(\varTheta ) \end{aligned}$$
(2)

Here \(F(\varTheta )\) represents the full-image feature map, which is reshaped as a m-d vector (\(m = WH\)) where \(W \times H\) corresponds to the spatial resolution for the full-image feature map. G denotes the cropping and warping operations, and it is a \(m' \times m'\) matrix with \(m' = W'H'\) representing the RoI warping output. \({F_i}^{RoI}(\varTheta )\) is a \(m'\)-d vector corresponding to the pre-defined warping output resolution \(W' \times H'\). The computation in Eq. (2) has the following representation:

$$\begin{aligned} \begin{aligned} {F_i}^{RoI}(u',v'|\varTheta )&= \sum \limits _{(u,v)}^{W \times H} {G(u,v;u',v'|{B_i}(\varTheta )){F_{(u,v)}}(\varTheta )} \\&= \sum \limits _{(u,v)}^{W \times H} {R(u,u'|{x_i},{w_i})R(v,{v'}|{y_i},{h_i}){F_{(u,v)}}(\varTheta )} \\ R(u,u'|{x_i},{w_i})&= max(0,1 - |({x_i} + \frac{{u'}}{{W'}}{w_i} - u)|) \\ R(v,v'|{y_i},{h_i})&= max(0,1 - |({y_i} + \frac{{v'}}{{H'}}{h_i} - u)|) \\ \end{aligned} \end{aligned}$$
(3)

Here \((u',v')\) is the pixel coordinate in the target \(W' \times H'\) feature map, and (uv) is defined similarly. The function G denotes transforming the bounding box size from \(({x_i} - {w_i}/2,{x_i} + {w_i}/2) \times ({y_i} - {h_i}/2,{y_i} + {h_i}/2)\) into \((- W'/2,W'/2) \times (- H'/2,H'/2)\), while R represents the bilinear interpolation function.

After the special RoI pooling, we utilize two fc layers to reduce dimension and regress mask. The feature dimension is reduced to 256 by the first fc layers. The second fc layer has high dimension 784-way output for regressing a pixel-level mask Mi with a spatial resolution of \(n \times n\) (we use n = 28). Then the mask is parameterized by an \({n^2}\)-dimensional vector. The loss function of stage-2 is formally written as:

$$\begin{aligned} {L_{mask}}(\varTheta ) = {L_{mask}}(M(\varTheta )|B(\varTheta )) \end{aligned}$$
(4)

As a related method, DeepMask also regresses discretized masks. DeepMask applies the regression layers to dense sliding windows (fully-convolutionally), but our method only regresses masks from a few proposed boxes and so reduces computational cost. Moreover, mask regression is only one stage in our network cascade that shares features among multiple stages, so the marginal cost of the mask regression layers is very small.

Categorizing Instance. Given the provided binary mask \(M_i\) and warped feature maps \(F_i^{RoI}\), we can compute the masked feature maps \(F_i\) according to the following element-wise product:

$$\begin{aligned} {F_i} = F_i^{RoI}(\varTheta ) \bullet {M_i}(\varTheta ) \end{aligned}$$
(5)

The loss term \({L_{mask}}\) is described as:

$$\begin{aligned} {L_{classify}}(\varTheta ) = {L_{classify}}(C(\varTheta )|B(\varTheta ),M(\varTheta )) \end{aligned}$$
(6)

where C is the output of stage-3, representing category prediction list: \(C = \{{C_i}\}\).

Then the loss function of the whole network is defined as follows:

$$\begin{aligned} L(\varTheta ) = {L_{RPN}}(\varTheta ) + {L_{mask}}(M(\varTheta )|B(\varTheta )) + {L_{classify}}(C(\varTheta )|B(\varTheta ),M(\varTheta )) \end{aligned}$$
(7)

where balance weights of 1 are implicitly used among the three terms. L is minimized w.r.t. the network parameters.

This loss function is unlike traditional multi-task learning, because the loss term of a later stage depends on the output of the earlier ones. Based on the chain rule of backpropagation, the gradient of \(L_{mask}\) involves the gradients w.r.t. B. The main technical challenge of applying the chain rule to Eq. (7) lies on the spatial transform of a predicted box \(B_i\) that determines RoI pooling. However, this can be solved by the RoI warping layer. And we finally train the model by Stochastic Gradient Descent (SGD) for the whole objective function.

2.3 Patch Fusion Algorithm

After network segmentation and classification, we get the predicted label \({y_i}\) and semantic mask for each patch. We get the final results by fusing the semantic masks from nearby patches. In order to reduce the accumulative error, we fuse the patches with a pyramid polymerization method.

For each \({P_i}\), we compute the overlap score of semantic mask from neighboring patches which have the same predicted label \({y_i}\). We denote \({s_{mn}}\) as the overlap score of \({P_m}\) and \({P_n}\), which is defined by the intersection-over-union score (IoU). We search patches from two branches. The row search range includes the patches located on the left side of \({P_m}\), denoted as \({C_l}({P_i})\). The column search range includes the patches located on top of \({P_m}\). We denote patches in this range as \({C_t}({R_i})\). All the patches from row and column directions will be iterated. The overlapped scores \({s_{mn}}\), nearby patches \({P_m}\) and \({P_n}\), are all stored in \({\varGamma _2}\), and we will merge patch pair which has the highest overlap score, until there is no patch pair with the overlap score higher than threshold \(\tau \).

3 Experiments

3.1 Implementation Details

Positive/Negative Samples. On the second stage, we compute the highest overlap score for each regressed box with respect to the ground truth mask. An RoI is considered positive and contributes to the mask branch if the box IoU is larger than 0.5, and negative otherwise. On stage 3, we adopt two sets of positive/negative samples. For the first set, if the box-level IoU between box and the nearest ground truth box \({\geqslant }\ 0.5\), the RoI is regarded as positive samples (the rest are the negative samples). For the second set, the positive samples are the objects that overlap with ground truth objects by box-level IoU \({\geqslant }\ 0.5\) and mask-level IoU \({\geqslant }\ 0.5\). The classification loss is only defined on positive RoIs. The stage 4 and stage 5 share the similar definition as stage 2 and stage 3.

Inference. For each scale of the original image, the RPN network generates \({\sim }{10^4}\) RoIs on the first stage. Non-maximum suppression (NMS) with IoU ratio 0.7 is used to filter out redundant regressed boxes for stage 2. After that, the binary mask branch and classification branch are then applied to the top-ranked 300 RoIs boxes. For each box, we can archive corresponding binary mask (in probability) and classification scores by the two branches. Meanwhile, the RoIs are classified to categories with highest classification scores by the classification branch.

Training. We train and fine-tune our system based on the Caffe platform. We initialize the shared convolutional layers with the released pre-trained VGG-16 model. While for the extra layers, we randomly initialize them. We train the five stages end to end with convolutional features shared. For this cascade network, the later stage takes the outputs of the earlier stages as inputs. The initial learning ratio is set to 0.001 for 32k iterations, which is decreased by 10 at the 8k iterations. We use SGD optimization and train the model on a single Titan X GPU, with a weight decay rate of 0.0005 and a momentum value of 0.9. We use the image patches generated from Sect. 2.1, namely, 64\(\,\times \,\)64, 96\(\,\times \,\)96, and 256\(\,\times \,\)256.

3.2 Experiments on PASCAL VOC 2012

We first conduct experiments on PASCAL VOC 2012. Following the protocols heavily used in [1,2,3, 9], the models are trained on the VOC 2012 training set, and evaluated on the validation set. Note that we also use extra annotations from [5], which has 10,582 images for training, 1449 images for validating, and 1456 images for testing. All the approaches are evaluated by mask-level IoU between predicted segmentations and the ground-truth. We measure the mean Average Precision (mAP\(^r\), the superscript r corresponds to the segmented region) using IoU threshold at 0.5 and 0.7. Our model is short for SPF\(_S\), SPF\(_M\), SPF\(_L\) (input at different scales).

Comparisons with State-of-the-art Approaches. The proposed SPF method is compared with the state-of-the-art approaches on VOC 2012 segmentation val, as shown in Table 1. It is noteworthy that our method outperforms all the state-of-the-art models, including MNC [3] and FCIS [9] which are the winners of the COCO 2015 and 2016 segmentation challenges. [3] proposed a novel multi-task multi-stage cascade network that predicted segment proposals from RoIs, followed by a classification branch. Yet it is unable to capture detail spatial structural information. Li et al. [9] presented the first fully convolutional system for SDS. But this method creates artificial edge information. Contrarily, our method can provides more robust structural information and eliminate artificial edge information by taking the scale factor into account. In addition, the proposed approach can also handle challenging cases where the input image has multiple scales and where the instance category is similar to the background (e.g., Fig. 2). Table 1 shows that the SPF\(_L\) is \({\sim }\)4% higher mAP\(^r\)@0.5 than MNC and 1.7% higher mAP\(^r\)@0.5 than FCIS.

Influence of IoU Threshold. Table 2 explores the impact of different IoU thresholds by gradually increasing it from 0.6 to 0.9. We can find that the accuracy improves when the number of scales increases. And the accuracy is improved while the IoU threshold is decreased. Our method gets the best result when \(\tau \) is set to 0.6. So we use the IoU threshold of 0.6 for all the other experiments.

Qualitative Results. Figure 3 depicts visualizations of several sample results on VOC 2012 validation set. We show the ground-truth label and our predicted results for each image. It is suggested that our approach can generate high-quality segmentation masks. Both the large instances and small instances are segmented out and classified.

Table 1. Comparison of SDS on PASCAL VOC 2012 val set. MNC [3] and FCIS [9] are the winners of the COCO 2015 and COCO 2016 segmentation challenges.
Table 2. Experimental results (in AP\(^r\)) on PASCAL VOC 2012 segmentation val with different IoU threshold \(\tau \).
Fig. 2.
figure 2

Results produced by MNC and our SPF. We can generate fine segmentations compared to MNC, and handle challenging instance.

Fig. 3.
figure 3

Example results generated by our SPF network on PASCAL VOC 2012 validation set. For each image, we show the ground-truth label, MNC results and our segmentation results.

Table 3. More results on VOC 2012 SDS val. The best mAP\(^r\) is bold-faced.

3.3 Experiments on VOC 2012 SDS

We perform a thorough comparison of SPF to the leading approaches [2, 6, 7] on this dataset, as shown in Table 3. There are 5623 training images, and 5732 validation images in this subset. For evaluation, we use the metric of mAP\(^r\) similar to [6] with the pre-trained VGG-16 model. We apply one scale or three scales as input(s) for our network during the course of inference.

Fig. 4.
figure 4

Extra examples on VOC2012 SDS val set. Our approach can simultaneous segment the large and small instances out.

Specifically, Hypercolumn [7] refined the mask using the independent detection model for better performance. It is evident that our method achieves large improvement, even without this effective strategy. Table 3 demonstrates that our model with the single-scale input (61.3% mAP\(^r\)) is 0.6 points higher than CFM and further improves results using multiple scales inputs, with a margin of 5.8 points mAP\(^r\) over [2]. This again indicates that scale is essential for detection and segmentation. Some segmentation results are visualized in Fig. 4.

3.4 Experiments on MS COCO

We finally evaluate our approach on the MS COCO dataset. This dataset includes 80 object categories and numerous comprehensive images. Our network is trained on 80k + 40K trainval images, and the results are reported on the test-std and test-dev set. We evaluate the performance using three standard metrics: standard COCO evaluation metric (mAP\(^r\)@[0.5:0.95]), PASCAL VOC metric (mAP\(^r\)@0.5), as well as the traditional metric (mAP\(^r\)@0.75).

Comparison with MNC. We compare the SPF approach with MNC [3], the winner in 2015 COCO segmentation challenge. These two approaches share similar architecture and similar training/inference procedures. For fair comparison, the common implementation details are kept the same. Table 4 shows the results with VGG-16 model. The SPF\(_L\) archives mAP\(^r\)@[0.5:0.95] score of 22.4% on COCO test-dev subset. Although without any effective strategies, the SPF\(_L\) is 2.9% absolutely higher than MNC. The experimental results demonstrate that the improved accuracy is more significant for small instances, suggesting that the SPF system can capture much more detail spatial structural information.

Table 4. Experimental results on Microsoft COCO test-std and test-dev set using the mAP\(^r\) metric at different thresholds using VGG-16 model.

Ablation Study on MS COCO. We finally conduct a number of ablations with ResNet-101 model. The results are presented in Table 5, and analyzed in the following.

SRF baseline: The baseline SPF approach obtains an mAP\(^r\)@[0.5:0.95] of 38.8%, which has already outperformed FCIS+++ [9] by 1.2%. That strongly confirms the effectiveness of SPF on segmenting instances.

Horizontal flip: Following [3, 9], our SPF system is trained on both the original images and the horizontal flipped images. This preprocessing operation leads to a further improvement of 0.3%, verifying the translation-variant property of SPF.

Ensemble: Similar to [8], we only utilize two models with different depths to form the ensemble. The final performance is 39.5%, which is increased by 0.4%. We also apply OHEM to this method which helps improve the accuracy by 0.2%.

Table 5. Results of instance semantic segmentation on COCO test-dev set using ResNet-101 model.

4 Conclusion

The presented scale-aware patch fusion network takes multiple scales patches as inputs and segments the patches via a multi-task network. Finally, we merge the patch segmentation result with fusion algorithm to get the whole result. This unified trainable network inherits all the merits of mid-level patches and we are provided much more detail information. We evaluate our approach on three datasets for semantic segmentation. Although our approach has achieved promising results, further research will be carried out to improve the efficiency of segmentation.