A Modified PSRoI Pooling with Spatial Information

Zheng, Yiqing; Hu, Xiaolu; Bi, Ning; Tan, Jun

doi:10.1007/978-3-030-03335-4_38

Yiqing Zheng¹⁹,
Xiaolu Hu¹⁹,
Ning Bi¹⁹ &
…
Jun Tan¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11257))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2635 Accesses
1 Citations

Abstract

Position Sensitive RoI pooling in RFCN [5] is a RoI pooling that contains the position information. Each RoI rectangle will be devided into K $\times $ K bins by a regular grid. In this paper, we present a modified PSRoI pooling that contains spatial information. Every bin in PSRoI pooling is rescaled to 2$\times $. With this proposed network, the spatial information around the bin will be added to predict the classifications. The weight outside the bin is simply set to 0.5 manually. We use ResNet101 backbone network to test our model on PASCAL VOC and MS COCO. We gain some improvement compared with RFCN [5] on the VOC and MS COCO Dataset.

You have full access to this open access chapter, Download conference paper PDF

Introducing Region Pooling Learning

Bi-linearly weighted fractional max pooling

Article 31 May 2017

Large Kernel Spatial Pyramid Pooling for Semantic Segmentation

Keywords

1 Introduction

Object Detection is a computer vision task that detects the objects in the images or videos. In this task, we should find out the objects and locate them. There may be one or more object in the images. It’s difficult to locate the object which will be anywhere in the images. There are a series of methods to deal with the task. The methods based on DNN can be generally devided into 2 groups: (1) one-stage methods and (2) two-stage methods.

One-stage methods including YOLO [9] and SSD [10] perform more efficient than two-stage methods and are applied in real time object detection. One-stage methods generate the classification and the boxex information directly by regression, without region proposal network. Two-stage methods are based on RCNN [4] architecture, and improved to many other methods, like Fast RCNN, Faster RCNN, RFCN, Mask RCNN [1,2,3, 5] and so on. In the two-stage methods, deep convolutional neural networks pretrained on Imagenet are used to extract feature maps, and then fine tuned through the backpropagation process. Two-stage methods can be devided to 2 subnetworks, region proposal network and prediction network. In common, two-stage methods perform better than one-stage methods in object detection. It has to be said that accuracy and speed are a pair of contradictions, and how to better balance them has been an important direction of the research of the target detection algorithm.

In current research, some significant methods are proposed and perform well such as [3, 8]. FPN [8] propose A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. The main idea is to build feature pyramid by using the multi-scale Pyramid shaped hierarchy inherent in the deep convolution network.And the feature pyramid creates a top-down architecture with a lateral connection to build high-level semantic feature maps on all scales. FPN is a general feature extractor as a general feature extractor. It is still important to use Pyramid to clearly solve the multiscale problem with the strong expressive ability and internal robustness of the scale. Mask RCNN [3] propose a additional path for object mask task, and it becomes the beseline of object detection.

Region proposal network (RPN) is a significant subnetwork that outputs many candidate boxes about objects. The boxes and features will be input into the RoI pooling. In the RoI pooling, the features in the boxes are extracted and reshaped to a preset scale by a special pooling layer with variable shape of filters. Many improvements are proposed, like Position Sensitive RoI pooling (PSRoI pooling) [5] and RoIAlign [3]. PSRoI pooling, proposed in the RFCN, encodes the position information with respect to a relative spatial position. RoiAlign uses bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin. In this paper, a modified PSRoI pooling is proposed. Each bin in PSRoI pooling is scaled to 2$\times $ and includes more spatial information around the area. We use ResNet101 backbone network to test our proposed network. We get some improvement on the datasets.

2 Related Work

In the Object Detection algorithm based on the region proposal, Fast R-CNN [2] adds an ROI pooling layer after the last convolutional layer of the R-CNN [4]; the box regression is added to CNN training process; Softmax is used instead of SVM for classification, and end-to-end is also implemented. Faster R-CNN [1] uses RPN instead of Fast R-CNN’s Selective Search method [2] to allow RPN and Fast R-CNN networks to share feature extraction networks. Mask R-CNN [3] adds FCN to generate corresponding MASK branches based on the original Faster R-CNN algorithm. The algorithm can be used to accomplish various tasks such as target classification, target detection, semantic segmentation, instance segmentation, and human pose recognition.

We know that classification requires features with translation invariance, and object detection requires accurate responses to the translation of the object. It can be seen that most CNNnet can do a good job in classification, but they are not very effective in detecting them. So for this problem, we find that methods such as Faster R-CNN [1] are convolutional before ROI pooling. They are translation-invariant, but once ROI pooling is inserted, the underlying network structure is no longer translation-invariant. Then position-sensitive score map proposed in RFCN [5] can integrate the object position information into ROI pooling, which can solve this problem well. At the same time, CoupleNet [6] introduced the global and context information of the proposal on the basis of the original RFCN [5], and improved the accuracy of detection by combining the information of part, global, and context. According to different layers of feature map, the advantages are different. A multi-scale convolutional neural network is proposed, and different-scale detectors are designed for different layers on the feature map [7]. Based on the RFCN network, this paper builds a modified position sensitive pooling to contain spatial information and achieve some improvement on some dataset.

3 Our Approach

3.1 PSRoI Pooling

Since our work is based on the PSRoI pooling [5], we first introduce the PSRoI pooling. In RFCN, the images are input into the backbone subnetwork, and extracted to feature maps. The feature maps then generate score maps and location information. In PSRoI pooling, score maps and candidate boxes are the inputs. The score maps have K $\times $ K $\times $ (C+1) channels. We extract the sub-area of score maps according to the candidate boxes. The RoI is then divided to K $\times $ K bins. Refer to RFCN, the (i, j)-th bin$(0\le {i, j}\le {k-1})$ operation in position-sensitive RoI pooling [5] is:

$$\begin{aligned} r_c(i,j|\varTheta )=\sum _{(x,y)\in {bin(i,j)}}{z_{i,j,c}(x+x_0,y+y_0|\varTheta )/n} \end{aligned}$$

(1)

Here r$_c(i,j)$ is the pooled response in the (i, j)-th bin for the c-th category, $z_{i,j,c}$ is one score map of the $k^2(C+1)$ score maps, ($x_0$, $y_0$) denotes the top-left corner of an RoI, n is the number of pixels in the bin, and $\varTheta $ demotes all learnable parameters of the network. When we compute the score of one bin, such as the top-left bin, we first find out the corresponding C + 1 score maps. We extract the sub-area of the bin and compute the average score as output. Then we vote to get the final score of the whole RoI of every class.

3.2 Our Modified Model

Refer to [6, 7], using larger RoI which includes spatial information can help to improve the performent of the network. As seen in PSRoI pooling, scores is computed restricted in the bin. When computing the score in each bin, we rescale the size of the bin to 2$\times $. Figure 1 is an example of our model. For the central bin of the PSRoI pooling, the original PSRoI pooling will compute the score for C + 1 classification in the corresponding RoI while our modified PSRoI pooling will compute the score for the C + 1 classification in the 2$\times $ RoI. For example, $(x_0,y_0,x_1,y_1)$ is the coordinate of one bin in the RoI. Then the weight of the bin is $x_1-x_0$ and the height of the bin is $y_1-y_0$. In our modified network, the coordiinate is set to $(x_0-(x_1-x_0)/2,y_0-(y_1-y_0)/2,x_1+(x_1-x_0)/2, y_1+(y_1-y_0)/2)$. With our modified, some area may be outside the image, we set the score equal to 0 outside the image. Therefore, we add the spatial information around the RoI which will improve the performent of the network. We set the weight of outside the bin equal to 0.5 manually. Then score of the (i, j)-th bin$(0\le {i,j}\le {k-1})$ is modified to:

$$\begin{aligned} r_c(i,j|\varTheta )=\sum _{(x,y)\in {bin^{\star }(i,j)}}{w\times z_{i,j,c}(x+x_0,y+y_0|\varTheta )/2n} \end{aligned}$$

(2)

where $bin^{\star }(i,j)$ is the 2$\times $ bin(i, j) and w is 1 where $(x,y)\in bin(i,j)$ and 0.5 where $(x,y)\not \in bin(i,j)$.

4 Experiments

4.1 Experiments on PASCAL VOC

PASCAL VOC has 20 object categories. The VOC 2007 and the VOC 2012 are widely used in object detection. We first train on the union set of VOC 2007 trainval and VOC 2012 trainval, and test on VOC 2007 test. We use ResNet101 backbone to compute the feature maps. And in the PSRoI pooling, every RoI is divided to 7 $\times $ 7(K = 7). The multi-scale training strategy [11] is adopted here. Follow the RFCN [5], for each training iteration, we resize the image randomly to 400, 500, 600, 700 or 800. When testing, we only test on a single scale of 600 pixels. The strategy is adopted in following experiment. The learning rate of the network is set to 0.001 for the first 30k iterations and then set to 0.0001 for the rest with a mini-batch size of 8. We compare RFCN [5] and our network. The result is in Table 1.

Then we test on the VOC 2012. We train the network on the union set of VOC 2007 trainval, VOC 2007 test and VOC 2012 trainval. The parameter is the same as above. In the PSRoI pooling, every RoI is divided to 7 $\times $ 7(K = 7). We compare with the result of RFCN [5], and our network has some improvement. Table 2 is the result.

Table 1. The results on PASCAL VOC 2007. Training data is the union set of VOC 2007 trainval and VOC 2012 trainval and the test data is VOC 2007 test. The backbone of the network is ResNet101.

Full size table

Table 2. The results on PASCAL VOC 2012. Training data is the union set of VOC 2007 trainval, VOC 2007 test and VOC 2012 trainval and the test data is VOC 2012 test. The backbone of the network is ResNet101. The model is train multi-scale.

Full size table

4.2 Experiments on MS COCO

MS COCO has 80 object categories. We train our network on the trainval and test on the test-dev. The learning rate is set to 0.001 for the first 110k iterations and 0.0001 for the rest. We use ResNet101 to compute the feature maps. in the PSRoI pooling, every RoI is divided to 7 $\times $ 7(K = 7). The multi-scale training strategy [11] is adopted here. The result is in the Table 3. We find that our network have some improvement compared with RFCN [5].

Table 3. The results on MS COCO. Training data is the COCO trainval set and the test data is COCO test-dev set. The backbone of the network is ResNet101.

Full size table

References

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS (2016)
Google Scholar
Zhu, Y., Zhao, C., Wang, J., et al.: CoupleNet: coupling global structure with local parts for object detection. In: ICCV (2017)
Google Scholar
Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 354–370. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_22
Chapter Google Scholar
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-timeobjectdetection. In: CVPR (2016)
Google Scholar
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Sun Yat-sen University, Guangzhou, China
Yiqing Zheng, Xiaolu Hu, Ning Bi & Jun Tan

Authors

Yiqing Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolu Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ning Bi
View author publications
You can also search for this author in PubMed Google Scholar
Jun Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiqing Zheng .

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, China
Jian-Huang Lai
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Tsinghua University, Beijing, China
Jie Zhou
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Jiaotong University, Xi’an, China
Nanning Zheng
Peking University, Beijing, China
Hongbin Zha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, Y., Hu, X., Bi, N., Tan, J. (2018). A Modified PSRoI Pooling with Spatial Information. In: Lai, JH., et al. Pattern Recognition and Computer Vision. PRCV 2018. Lecture Notes in Computer Science(), vol 11257. Springer, Cham. https://doi.org/10.1007/978-3-030-03335-4_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-03335-4_38
Published: 02 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03334-7
Online ISBN: 978-3-030-03335-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Modified PSRoI Pooling with Spatial Information

Abstract