Abstract
The task of object detection is to identify the bounding box of the object and its corresponding category in images. In this paper, we propose a new one-stage anchor free object detection algorithm OSAF_e, with the consideration of effective mapping area. A feature extraction network is used to obtain high level feature, and the true bounding box of the object in the original image is mapped to the grid of feature map, in order to perform category prediction and bounding box regression. The proposed algorithm is evaluated with the Pascal Voc dataset, and the experiments indicate that it has a better result.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Object detection is a classic problem in the field of computer vision. It aims to provide the location and category information of objects in given images. The deep learning theory draws the attention of many researches, and the emerged anchor-free approach is a hot-spot in this area.
Recently, most of the researchers focus on anchor free method [1, 2, 10]. One special method, CornerNet [1], performs bounding box regression by detecting a pair of corners of a bounding box and grouping them to form the final result, but the network only predicts a set of position offsets when the corners of multiple objects fall in the same gird. FCOS [2] and FoveaBox [10] rely on features in each grid to perform classification and bounding box regression, where the detection category of each grid is determined by the mapping of true bounding boxes.
However, mapping the bounding box of the object in the original image to the feature map grid brings two problems. First, how to compute the mapping area in the feature map, Second, how to determine the detection category of the occlusion area.
Based on the idea that the results of bounding boxes predicated by locations far away from the center of an object is of low-quality [2], we only map the 1/4 center regions of the true bounding box to the feature map as effective mapping area. When multiply bounding boxes are mapped to the same grid, the detection category of this gird is determined by the distance to the right bounding boxes.
The contributions of this paper are as follows:
First, we propose a method to compute the mapping area in the feature map that specially draw attention to the features near to centers of an object.
Second, we propose a method to determine the grid detection category when multiple object bounding box are mapped to the same feature map grid.
Third, we conduct a series of experiments in the Pascal Voc dataset. And the results demonstrate the effectiveness of our method.
2 Related Work
At present, deep learning based object detection methods mainly including one-stage methods and two-stage methods. One-stage methods simplifies the object detection process into a unified end-to-end regression problem, YOLO series [5,6,7] and SSD [8] are representatives. Two-stage methods divide detection into two stages, firstly, a RPN [15] network is used to select candidate regions, then classify the candidate regions and positioned to output finally results, such R-CNN series [3, 4, 15].
According to the differences of whether to use anchor boxes in bounding boxes regression period, we divide object detection into anchor-based methods and anchor free methods. In anchor-based methods, the network learns a function from anchor box to the true bounding box [3, 4, 7]. However, anchor-based detectors need to preset many hyperparameters, such as anchor’s shape, size, number, IoU threshold, and so on. These parameters need to be carefully designed and are very dependent on the distribution of bounding boxes in the dataset. At the same time, it also brings imbalance [12, 13] between positive and negative samples.
Anchor free methods directly predicate the bounding boxes, simplifies the detection process [1, 2, 5, 15]. However, it cannot effectively solve the problem of occlusion. Traditionally, FPN [17] is used to detect objects in separate layers in order to solve the occlusion problem, however, the occlusion still appears in single layers.
Motivated by above contents, we proposed a new one-stage anchor free object detection method with the consideration of effective mapping area. We first map the bounding boxes of an image into feature map to divide detection category of each gird. Then based on the right distance judging metrics to divide the overlapping gird detection category in order to solve the problem of occlusion.
3 Methodology
3.1 Overall Network Architecture
In this paper, we introduce OSAF_e to develop a general model by incorporating the one-stage object detection process with the consideration of effective area. An overview of our OSAF_e can be found in Fig. 1. More specially, we first calculate effective mapping area in feature maps to generate high-level semantic area of a bounding by mapping true bounding boxes of an image to feature map grids. Then judging the predication task of each feature map grid to overcome the problem of occlusion by determining the distance to right bounding boxes. Finally, the channel features in each grid is used to perform class prediction and bounding box regression.
3.2 Area Mapping
Our OSAF_e method first mapping true bounding boxes in an image to feature maps to generate high-level semantic areas. Define the input image size as X and the total down sampling step of feature extraction network as S, then the size of the feature map is X/S, defining the bounding box of an object in the image as \( B_{i} = \left\{ {l_{i} ,r_{i} ,t_{i,} b_{i} ,c_{i} } \right\} \), the area mapped by the bounding box \( B_{i} \) on the feature map is \( B_{fi} = \left\{ {l_{fi} ,r_{fi} ,t_{fi,} b_{fi} ,c_{i} } \right\} \), The mapping functions are shown below,
Based on the idea that the results of bounding boxes predicated by locations far away from center of an object is of low-quality [2]. we draw on FoveaBox [10] and propose the idea of effective mapping area. Specifically, we fix the center point of bounding box in original image, and set the effective width and height of the object bounding box to half of the original width and height, then use the mapping operation shown in function 1, 2, 3 and 4 to generate effective feature map area of object \( B_{efi} = \left\{ {l_{efi} ,r_{efi} ,t_{efi,} b_{efi} ,c_{i} } \right\} \).
During training, when the feature map grid \( \left( {x,y} \right) \) falls within feature map area \( B_{efi} \), the grid is considered as a positive sample and sets the detect category label of this grid is \( c_{i} \), otherwise it is a negative sample and its detection category label is 0.
3.3 Overlapping Area Judging
Our OSAF_e method also brings overlapping area judging problem, which doesn’t exist in anchor-based detectors. When multiple bounding boxes in an image overlaps, the overlapping grid belongs to multiple categories in Fig. 2.
In this way, it is necessary to determine the detection category in overlapping grid. We simply calculate the distance from the center point of grid to the right of multiple overlapping bounding boxes, and select the smallest distance corresponding category as detection category label of this grid.
3.4 Bounding Box Regression
We based on the channel features in grid \( \left( {x,y} \right) \) to perform bounding box regression [2]. Its output is a 4D vector \( \left\{ {l_{p} ,t_{p} ,r_{p} ,b_{p} } \right\} \) of the distance from the current grid center to the four sides of normalized bounding box. Let the center point of current grid is \( \left( {x^{i} ,y^{i} } \right) \), the 4D vector of the regression of bounding box can be formulated as,
It is different from anchor-based detectors, which only consider the anchor boxes with a highly enough IoU with ground-truth boxes as positive samples.
3.5 Loss Function
We define the training loss as follows:
Following SSD [8], the legislation activation function is used for classification process, where loss includes classification loss and bounding box regression loss, \( s^{2} \) represents the total number of feature map grids.
\( I_{i}^{ige} \) indicates that the grid \( \left( {x,y} \right) \) falls within \( B_{fi} \) but outside \( B_{efi} \), and the IoU of the predicted bounding box and the true bounding box in this gird is greater than 0.5.
\( I_{i}^{obj} \) shows that the bounding box regression loss is calculated only when the grid is divided into positive samples. The grid is divided into positive samples based on the following two cases,
-
Case 1: The grid \( \left( {x,y} \right) \) falls within the effective feature map area \( B_{efi} \),
-
Case 2: The grid \( \left( {x,y} \right) \) falls in the \( I_{i}^{ige} \) area.
4 Experiment
Experiment Data.
We perform the experiments on the Pascal Voc data set. The samples from 2008 to 2012 are mixed for joint training. The samples of the training set and validation set are scrambled and mixed, and the number of training samples and validation samples are regenerated in a 4:1 ratio. The final training samples are 9633, and the validation samples are 1905. All experimental result data and analysis are performed on the 2012 test set.
Network Architecture.
We treat the state-of-the-art YOLOv3 [7] as our baseline and implement OSAF_e stacked on it in TensorFlow. Darknet-53 pretrained on ImageNet [14] is used as our backbone network for feature extraction. We fixe input image size as 416 × 416 during training and testing. The OSAF_e network just replaces YOLOv3’s three anchor-based layer with anchor free layer.
Parameter Settings.
A wide range of data augmentation techniques are used to prevent overfitting when training. We apply momentum optimization algorithm to optimize the model. Using piecewise metrics to adjust learning rate, the first 25 epochs are 1e−4, the middle 40 epochs are 3e−5 and the last 35 epochs are 1e−4.
4.1 Comparison with Different Size of Feature Maps
We report the result comparisons on OSAF_e with different size of feature maps in Table 1. We adopt the metrics Average Precision (AP) across IoU thresholds from 0.5 to 0.7 with an interval of 0.1 to evaluate the performance. C1, C2, and C3 respectively indicate that the task of object detection is performed on feature maps with size of 13 × 13, 26 × 26, and 52 × 52.
As can be seen, our OSAF_e achieves an AP50 of 61.5% compared to 60.8% when considering effective area in C2 layer for training and evaluation. Significant performance gap can also be observed for AP60 and AP70. This verifies the effectiveness that the grid channel features far away from the center of an object in feature map is of low-quality.
We can also see that our OSAF_e achieves highest APs scores in C2 layer, the reasons can be that C1 layer lost most features of small objects when feature extraction, and the large objects have larger mapping areas one feature map in C3 layer, simple NMS method cannot effective remove the wrong bounding box predication results.
Figure 3 shows some qualitative results using C2 layer as detection layer which also considers the effective mapping area. The results show that our OSAF_e can effectively solve the problem of occlusion and can detect objects in some complex scenes.
4.2 Comparison with Different IoU Thresholds for NMS
To better understanding the simple NMS method cannot effective remove wrong bounding boxes prediction results, we compare different IoU thresholds for NMS period in Table 2. Note that larger object can predicate more bounding boxes regression results for our OSAF_e performs bounding boxes predication in every feature map grid. We choose C3 layer to conduct experiment for it can generate at most 2704 bounding boxes predication results, but 676 in C2 layer, 169 in C1 layer.
As can be seen, our OSAF_e achieves poor AP50 performance score when using traditional 0.5 IoU thresholds for NMS. This indicates that the bounding boxes regression results for larger object are more disordered, we need set lower IoU threshold to dislodge as many as wrong bounding boxes predication results. Furthermore, we set a series of IoU threshold scores for comparison, the results show that 0.4 threshold get the best performance.
4.3 Comparison with State-of-the-Art
We report the results comparisons on VOC 2012 test dataset in Table 3. We compare with two-stage detectors, R-CNN [15], Fast R-CNN [3], Faster R-CNN [4], and one-stage detector YOLOV1 [5]. Our OSAF_e detector uses 0.5 threshold for NMS, C2 layer for detection and also considers effective area.
As can be seen, the mAP criterion scores of OSAF_e is 5.5% higher than YOLOV1 [5]. Among them, the AP of bike, bottle, chair, plant, and train categories are all improved more than 10% than YOLOv1 [5]. Moreover, OSAF_e far exceeds R-CNN [15]. But is weaker than two-stage based detectors, like Fast R-CNN [3] and Faster R-CNN [4]. The reason may be that suing simple NMS method for generating bounding boxes predication results cannot work well.
5 Conclusion
We have proposed a new one-stage anchor-free object detector OSAF_e, which solves the object detection in a per-pixel prediction fashion. As shown in experiments, OSAF_e outperforms YOLOv1 [5] and R-CNN [15] anchor-free method. It also avoids computation and hyper-parameters related to anchor boxes. This algorithm will be migrated to the feature pyramid networks [17] in subsequent work.
References
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Proceedings of European Conference on Computer Vision, pp. 734–750 (2018)
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355 (2019)
RossGirshick: Fast R-CNN. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1440–1448 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Liu, W., et al.: SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, pp. 21–37 (2016)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2980–2988 (2017)
Kong, T., Sun, F., Liu, H., Jiang, Y., Shi, J.: FoveaBox: beyond anchor-based object detector. CoRR abs/1904.03797(2019)
Lin, T.-Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 4 (2017)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y.: RON: reverse connection with objectness prior networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 2 (2017)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Acknowledgement
This work was supported in part by the National Key R&D Program of China (No. 2018YFC0807500), and by Ministry of Science and Technology of Sichuan Province Program (No. 2018GZDZX0048, 2018JY0067, 20ZDYF0343, 2018HH0075).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Y., Zhang, L., Rao, Z., Duan, G., Wang, C. (2020). OSAF_e: One-Stage Anchor Free Object Detection Method Considering Effective Area. In: Xu, R., De, W., Zhong, W., Tian, L., Bai, Y., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2020. AIMS 2020. Lecture Notes in Computer Science(), vol 12401. Springer, Cham. https://doi.org/10.1007/978-3-030-59605-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-59605-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59604-0
Online ISBN: 978-3-030-59605-7
eBook Packages: Computer ScienceComputer Science (R0)