Keywords

1 Introduction

Object Detection is a computer vision task that detects the objects in the images or videos. In this task, we should find out the objects and locate them. There may be one or more object in the images. It’s difficult to locate the object which will be anywhere in the images. There are a series of methods to deal with the task. The methods based on DNN can be generally devided into 2 groups: (1) one-stage methods and (2) two-stage methods.

One-stage methods including YOLO [9] and SSD [10] perform more efficient than two-stage methods and are applied in real time object detection. One-stage methods generate the classification and the boxex information directly by regression, without region proposal network. Two-stage methods are based on RCNN [4] architecture, and improved to many other methods, like Fast RCNN, Faster RCNN, RFCN, Mask RCNN [1,2,3, 5] and so on. In the two-stage methods, deep convolutional neural networks pretrained on Imagenet are used to extract feature maps, and then fine tuned through the backpropagation process. Two-stage methods can be devided to 2 subnetworks, region proposal network and prediction network. In common, two-stage methods perform better than one-stage methods in object detection. It has to be said that accuracy and speed are a pair of contradictions, and how to better balance them has been an important direction of the research of the target detection algorithm.

In current research, some significant methods are proposed and perform well such as [3, 8]. FPN [8] propose A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. The main idea is to build feature pyramid by using the multi-scale Pyramid shaped hierarchy inherent in the deep convolution network.And the feature pyramid creates a top-down architecture with a lateral connection to build high-level semantic feature maps on all scales. FPN is a general feature extractor as a general feature extractor. It is still important to use Pyramid to clearly solve the multiscale problem with the strong expressive ability and internal robustness of the scale. Mask RCNN [3] propose a additional path for object mask task, and it becomes the beseline of object detection.

Region proposal network (RPN) is a significant subnetwork that outputs many candidate boxes about objects. The boxes and features will be input into the RoI pooling. In the RoI pooling, the features in the boxes are extracted and reshaped to a preset scale by a special pooling layer with variable shape of filters. Many improvements are proposed, like Position Sensitive RoI pooling (PSRoI pooling) [5] and RoIAlign [3]. PSRoI pooling, proposed in the RFCN, encodes the position information with respect to a relative spatial position. RoiAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin. In this paper, a modified PSRoI pooling is proposed. Each bin in PSRoI pooling is scaled to 2\(\times \) and includes more spatial information around the area. We use ResNet101 backbone network to test our proposed network. We get some improvement on the datasets.

2 Related Work

In the Object Detection algorithm based on the region proposal, Fast R-CNN [2] adds an ROI pooling layer after the last convolutional layer of the R-CNN [4]; the box regression is added to CNN training process; Softmax is used instead of SVM for classification, and end-to-end is also implemented. Faster R-CNN [1] uses RPN instead of Fast R-CNN’s Selective Search method [2] to allow RPN and Fast R-CNN networks to share feature extraction networks. Mask R-CNN [3] adds FCN to generate corresponding MASK branches based on the original Faster R-CNN algorithm. The algorithm can be used to accomplish various tasks such as target classification, target detection, semantic segmentation, instance segmentation, and human pose recognition.

We know that classification requires features with translation invariance, and object detection requires accurate responses to the translation of the object. It can be seen that most CNNnet can do a good job in classification, but they are not very effective in detecting them. So for this problem, we find that methods such as Faster R-CNN [1] are convolutional before ROI pooling. They are translation-invariant, but once ROI pooling is inserted, the underlying network structure is no longer translation-invariant. Then position-sensitive score map proposed in RFCN [5] can integrate the object position information into ROI pooling, which can solve this problem well. At the same time, CoupleNet [6] introduced the global and context information of the proposal on the basis of the original RFCN [5], and improved the accuracy of detection by combining the information of part, global, and context. According to different layers of feature map, the advantages are different. A multi-scale convolutional neural network is proposed, and different-scale detectors are designed for different layers on the feature map [7]. Based on the RFCN network, this paper builds a modified position sensitive pooling to contain spatial information and achieve some improvement on some dataset.

3 Our Approach

3.1 PSRoI Pooling

Since our work is based on the PSRoI pooling [5], we first introduce the PSRoI pooling. In RFCN, the images are input into the backbone subnetwork, and extracted to feature maps. The feature maps then generate score maps and location information. In PSRoI pooling, score maps and candidate boxes are the inputs. The score maps have K \(\times \) K \(\times \) (C+1) channels. We extract the sub-area of score maps according to the candidate boxes. The RoI is then divided to K \(\times \) K bins. Refer to RFCN, the (ij)-th bin\((0\le {i, j}\le {k-1})\) operation in position-sensitive RoI pooling [5] is:

$$\begin{aligned} r_c(i,j|\varTheta )=\sum _{(x,y)\in {bin(i,j)}}{z_{i,j,c}(x+x_0,y+y_0|\varTheta )/n} \end{aligned}$$
(1)

Here r\(_c(i,j)\) is the pooled response in the (ij)-th bin for the c-th category, \(z_{i,j,c}\) is one score map of the \(k^2(C+1)\) score maps, (\(x_0\), \(y_0\)) denotes the top-left corner of an RoI, n is the number of pixels in the bin, and \(\varTheta \) demotes all learnable parameters of the network. When we compute the score of one bin, such as the top-left bin, we first find out the corresponding C + 1 score maps. We extract the sub-area of the bin and compute the average score as output. Then we vote to get the final score of the whole RoI of every class.

3.2 Our Modified Model

Refer to [6, 7], using larger RoI which includes spatial information can help to improve the performent of the network. As seen in PSRoI pooling, scores is computed restricted in the bin. When computing the score in each bin, we rescale the size of the bin to 2\(\times \). Figure 1 is an example of our model. For the central bin of the PSRoI pooling, the original PSRoI pooling will compute the score for C + 1 classification in the corresponding RoI while our modified PSRoI pooling will compute the score for the C + 1 classification in the 2\(\times \) RoI. For example, \((x_0,y_0,x_1,y_1)\) is the coordinate of one bin in the RoI. Then the weight of the bin is \(x_1-x_0\) and the height of the bin is \(y_1-y_0\). In our modified network, the coordiinate is set to \((x_0-(x_1-x_0)/2,y_0-(y_1-y_0)/2,x_1+(x_1-x_0)/2, y_1+(y_1-y_0)/2)\). With our modified, some area may be outside the image, we set the score equal to 0 outside the image. Therefore, we add the spatial information around the RoI which will improve the performent of the network. We set the weight of outside the bin equal to 0.5 manually. Then score of the (ij)-th bin\((0\le {i,j}\le {k-1})\) is modified to:

$$\begin{aligned} r_c(i,j|\varTheta )=\sum _{(x,y)\in {bin^{\star }(i,j)}}{w\times z_{i,j,c}(x+x_0,y+y_0|\varTheta )/2n} \end{aligned}$$
(2)

where \(bin^{\star }(i,j)\) is the 2\(\times \) bin(ij) and w is 1 where \((x,y)\in bin(i,j)\) and 0.5 where \((x,y)\not \in bin(i,j)\).

Fig. 1.
figure 1

The top figure shows the PSRoI pooling operation of the central bin. The bottom figure shows our modified model. The scale of the RoI is enlarge to 2\(\times \).

4 Experiments

4.1 Experiments on PASCAL VOC

PASCAL VOC has 20 object categories. The VOC 2007 and the VOC 2012 are widely used in object detection. We first train on the union set of VOC 2007 trainval and VOC 2012 trainval, and test on VOC 2007 test. We use ResNet101 backbone to compute the feature maps. And in the PSRoI pooling, every RoI is divided to 7 \(\times \) 7(K = 7). The multi-scale training strategy [11] is adopted here. Follow the RFCN [5], for each training iteration, we resize the image randomly to 400, 500, 600, 700 or 800. When testing, we only test on a single scale of 600 pixels. The strategy is adopted in following experiment. The learning rate of the network is set to 0.001 for the first 30k iterations and then set to 0.0001 for the rest with a mini-batch size of 8. We compare RFCN [5] and our network. The result is in Table 1.

Then we test on the VOC 2012. We train the network on the union set of VOC 2007 trainval, VOC 2007 test and VOC 2012 trainval. The parameter is the same as above. In the PSRoI pooling, every RoI is divided to 7 \(\times \) 7(K = 7). We compare with the result of RFCN [5], and our network has some improvement. Table 2 is the result.

Table 1. The results on PASCAL VOC 2007. Training data is the union set of VOC 2007 trainval and VOC 2012 trainval and the test data is VOC 2007 test. The backbone of the network is ResNet101.
Table 2. The results on PASCAL VOC 2012. Training data is the union set of VOC 2007 trainval, VOC 2007 test and VOC 2012 trainval and the test data is VOC 2012 test. The backbone of the network is ResNet101. The model is train multi-scale.

4.2 Experiments on MS COCO

MS COCO has 80 object categories. We train our network on the trainval and test on the test-dev. The learning rate is set to 0.001 for the first 110k iterations and 0.0001 for the rest. We use ResNet101 to compute the feature maps. in the PSRoI pooling, every RoI is divided to 7 \(\times \) 7(K = 7). The multi-scale training strategy [11] is adopted here. The result is in the Table 3. We find that our network have some improvement compared with RFCN [5].

Table 3. The results on MS COCO. Training data is the COCO trainval set and the test data is COCO test-dev set. The backbone of the network is ResNet101.