Keywords

1 Introduction

Object detection is a classic problem in the field of computer vision. It aims to provide the location and category information of objects in given images. The deep learning theory draws the attention of many researches, and the emerged anchor-free approach is a hot-spot in this area.

Recently, most of the researchers focus on anchor free method [1, 2, 10]. One special method, CornerNet [1], performs bounding box regression by detecting a pair of corners of a bounding box and grouping them to form the final result, but the network only predicts a set of position offsets when the corners of multiple objects fall in the same gird. FCOS [2] and FoveaBox [10] rely on features in each grid to perform classification and bounding box regression, where the detection category of each grid is determined by the mapping of true bounding boxes.

However, mapping the bounding box of the object in the original image to the feature map grid brings two problems. First, how to compute the mapping area in the feature map, Second, how to determine the detection category of the occlusion area.

Based on the idea that the results of bounding boxes predicated by locations far away from the center of an object is of low-quality [2], we only map the 1/4 center regions of the true bounding box to the feature map as effective mapping area. When multiply bounding boxes are mapped to the same grid, the detection category of this gird is determined by the distance to the right bounding boxes.

The contributions of this paper are as follows:

First, we propose a method to compute the mapping area in the feature map that specially draw attention to the features near to centers of an object.

Second, we propose a method to determine the grid detection category when multiple object bounding box are mapped to the same feature map grid.

Third, we conduct a series of experiments in the Pascal Voc dataset. And the results demonstrate the effectiveness of our method.

2 Related Work

At present, deep learning based object detection methods mainly including one-stage methods and two-stage methods. One-stage methods simplifies the object detection process into a unified end-to-end regression problem, YOLO series [5,6,7] and SSD [8] are representatives. Two-stage methods divide detection into two stages, firstly, a RPN [15] network is used to select candidate regions, then classify the candidate regions and positioned to output finally results, such R-CNN series [3, 4, 15].

According to the differences of whether to use anchor boxes in bounding boxes regression period, we divide object detection into anchor-based methods and anchor free methods. In anchor-based methods, the network learns a function from anchor box to the true bounding box [3, 4, 7]. However, anchor-based detectors need to preset many hyperparameters, such as anchor’s shape, size, number, IoU threshold, and so on. These parameters need to be carefully designed and are very dependent on the distribution of bounding boxes in the dataset. At the same time, it also brings imbalance [12, 13] between positive and negative samples.

Anchor free methods directly predicate the bounding boxes, simplifies the detection process [1, 2, 5, 15]. However, it cannot effectively solve the problem of occlusion. Traditionally, FPN [17] is used to detect objects in separate layers in order to solve the occlusion problem, however, the occlusion still appears in single layers.

Motivated by above contents, we proposed a new one-stage anchor free object detection method with the consideration of effective mapping area. We first map the bounding boxes of an image into feature map to divide detection category of each gird. Then based on the right distance judging metrics to divide the overlapping gird detection category in order to solve the problem of occlusion.

3 Methodology

3.1 Overall Network Architecture

In this paper, we introduce OSAF_e to develop a general model by incorporating the one-stage object detection process with the consideration of effective area. An overview of our OSAF_e can be found in Fig. 1. More specially, we first calculate effective mapping area in feature maps to generate high-level semantic area of a bounding by mapping true bounding boxes of an image to feature map grids. Then judging the predication task of each feature map grid to overcome the problem of occlusion by determining the distance to right bounding boxes. Finally, the channel features in each grid is used to perform class prediction and bounding box regression.

Fig. 1.
figure 1

Overall network architecture

3.2 Area Mapping

Our OSAF_e method first mapping true bounding boxes in an image to feature maps to generate high-level semantic areas. Define the input image size as X and the total down sampling step of feature extraction network as S, then the size of the feature map is X/S, defining the bounding box of an object in the image as \( B_{i} = \left\{ {l_{i} ,r_{i} ,t_{i,} b_{i} ,c_{i} } \right\} \), the area mapped by the bounding box \( B_{i} \) on the feature map is \( B_{fi} = \left\{ {l_{fi} ,r_{fi} ,t_{fi,} b_{fi} ,c_{i} } \right\} \), The mapping functions are shown below,

$$ l_{fi} = \left\{ {\begin{array}{*{20}l} {l_{i} /S, } \hfill & {if\, l_{i} /S - l_{i} /S \le 0.5} \hfill \\ {l_{i} /S + 1,} \hfill & {else} \hfill \\ \end{array} } \right. $$
(1)
$$ r_{fi} = \left\{ {\begin{array}{*{20}l} {r_{i} /S,} \hfill & {if\, r_{i} /S - r_{i} /S \le 0.5} \hfill \\ {r_{i} /S - 1,} \hfill & {else} \hfill \\ \end{array} } \right. $$
(2)
$$ t_{fi} = \left\{ {\begin{array}{*{20}l} {t_{i} /S,} \hfill & {if\, t_{i} /S - t_{i} /S \le 0.5} \hfill \\ {t_{i} /S + 1,} \hfill & {else} \hfill \\ \end{array} } \right. $$
(3)
$$ b_{fi} = \left\{ {\begin{array}{*{20}l} {b_{i} /S,} \hfill & {if \,b_{i} /S - b_{i} /S \le 0.5} \hfill \\ {b_{i} /S - 1,} \hfill & {else} \hfill \\ \end{array} } \right. $$
(4)

Based on the idea that the results of bounding boxes predicated by locations far away from center of an object is of low-quality [2]. we draw on FoveaBox [10] and propose the idea of effective mapping area. Specifically, we fix the center point of bounding box in original image, and set the effective width and height of the object bounding box to half of the original width and height, then use the mapping operation shown in function 1, 2, 3 and 4 to generate effective feature map area of object \( B_{efi} = \left\{ {l_{efi} ,r_{efi} ,t_{efi,} b_{efi} ,c_{i} } \right\} \).

During training, when the feature map grid \( \left( {x,y} \right) \) falls within feature map area \( B_{efi} \), the grid is considered as a positive sample and sets the detect category label of this grid is \( c_{i} \), otherwise it is a negative sample and its detection category label is 0.

3.3 Overlapping Area Judging

Our OSAF_e method also brings overlapping area judging problem, which doesn’t exist in anchor-based detectors. When multiple bounding boxes in an image overlaps, the overlapping grid belongs to multiple categories in Fig. 2.

In this way, it is necessary to determine the detection category in overlapping grid. We simply calculate the distance from the center point of grid to the right of multiple overlapping bounding boxes, and select the smallest distance corresponding category as detection category label of this grid.

Fig. 2.
figure 2

Right distance judging

3.4 Bounding Box Regression

We based on the channel features in grid \( \left( {x,y} \right) \) to perform bounding box regression [2]. Its output is a 4D vector \( \left\{ {l_{p} ,t_{p} ,r_{p} ,b_{p} } \right\} \) of the distance from the current grid center to the four sides of normalized bounding box. Let the center point of current grid is \( \left( {x^{i} ,y^{i} } \right) \), the 4D vector of the regression of bounding box can be formulated as,

$$ l_{p} = x^{i} - l_{i} ,r_{p} = r_{i} - x^{i} $$
(5)
$$ t_{p} = y^{i} - t_{i} ,b_{p} = b_{i} - y^{i} $$
(6)

It is different from anchor-based detectors, which only consider the anchor boxes with a highly enough IoU with ground-truth boxes as positive samples.

3.5 Loss Function

We define the training loss as follows:

$$ L_{total} = L_{cls} + L_{box} $$
(7)
$$ L_{cls} = \sum\nolimits_{i = 0}^{{s^{2} }} {\sum\nolimits_{j = 0}^{c} {\left( {1 - I_{i}^{ige} } \right)\left( {p_{cij} - t_{cij} } \right)^{2} } } $$
(8)
$$ L_{box} = \sum\nolimits_{i = 0}^{{s^{2} }} {I_{i}^{obj} \left[ {\left( {l_{pi} - l_{ti} } \right)^{2} + \left( {r_{pi} - r_{ti} } \right)^{2} + \left( {t_{pi} - t_{ti} } \right)^{2} + \left( {b_{pi} - b_{ti} } \right)^{2} } \right]} $$
(9)

Following SSD [8], the legislation activation function is used for classification process, where loss includes classification loss and bounding box regression loss, \( s^{2} \) represents the total number of feature map grids.

\( I_{i}^{ige} \) indicates that the grid \( \left( {x,y} \right) \) falls within \( B_{fi} \) but outside \( B_{efi} \), and the IoU of the predicted bounding box and the true bounding box in this gird is greater than 0.5.

\( I_{i}^{obj} \) shows that the bounding box regression loss is calculated only when the grid is divided into positive samples. The grid is divided into positive samples based on the following two cases,

  • Case 1: The grid \( \left( {x,y} \right) \) falls within the effective feature map area \( B_{efi} \),

  • Case 2: The grid \( \left( {x,y} \right) \) falls in the \( I_{i}^{ige} \) area.

4 Experiment

Experiment Data.

We perform the experiments on the Pascal Voc data set. The samples from 2008 to 2012 are mixed for joint training. The samples of the training set and validation set are scrambled and mixed, and the number of training samples and validation samples are regenerated in a 4:1 ratio. The final training samples are 9633, and the validation samples are 1905. All experimental result data and analysis are performed on the 2012 test set.

Network Architecture.

We treat the state-of-the-art YOLOv3 [7] as our baseline and implement OSAF_e stacked on it in TensorFlow. Darknet-53 pretrained on ImageNet [14] is used as our backbone network for feature extraction. We fixe input image size as 416 × 416 during training and testing. The OSAF_e network just replaces YOLOv3’s three anchor-based layer with anchor free layer.

Parameter Settings.

A wide range of data augmentation techniques are used to prevent overfitting when training. We apply momentum optimization algorithm to optimize the model. Using piecewise metrics to adjust learning rate, the first 25 epochs are 1e−4, the middle 40 epochs are 3e−5 and the last 35 epochs are 1e−4.

4.1 Comparison with Different Size of Feature Maps

We report the result comparisons on OSAF_e with different size of feature maps in Table 1. We adopt the metrics Average Precision (AP) across IoU thresholds from 0.5 to 0.7 with an interval of 0.1 to evaluate the performance. C1, C2, and C3 respectively indicate that the task of object detection is performed on feature maps with size of 13 × 13, 26 × 26, and 52 × 52.

As can be seen, our OSAF_e achieves an AP50 of 61.5% compared to 60.8% when considering effective area in C2 layer for training and evaluation. Significant performance gap can also be observed for AP60 and AP70. This verifies the effectiveness that the grid channel features far away from the center of an object in feature map is of low-quality.

We can also see that our OSAF_e achieves highest APs scores in C2 layer, the reasons can be that C1 layer lost most features of small objects when feature extraction, and the large objects have larger mapping areas one feature map in C3 layer, simple NMS method cannot effective remove the wrong bounding box predication results.

Table 1. Comparison of Average Precision (AP) on different size of feature maps.

Figure 3 shows some qualitative results using C2 layer as detection layer which also considers the effective mapping area. The results show that our OSAF_e can effectively solve the problem of occlusion and can detect objects in some complex scenes.

Fig. 3.
figure 3

Qualitative examples showing our OSAF_e detection result.

4.2 Comparison with Different IoU Thresholds for NMS

To better understanding the simple NMS method cannot effective remove wrong bounding boxes prediction results, we compare different IoU thresholds for NMS period in Table 2. Note that larger object can predicate more bounding boxes regression results for our OSAF_e performs bounding boxes predication in every feature map grid. We choose C3 layer to conduct experiment for it can generate at most 2704 bounding boxes predication results, but 676 in C2 layer, 169 in C1 layer.

As can be seen, our OSAF_e achieves poor AP50 performance score when using traditional 0.5 IoU thresholds for NMS. This indicates that the bounding boxes regression results for larger object are more disordered, we need set lower IoU threshold to dislodge as many as wrong bounding boxes predication results. Furthermore, we set a series of IoU threshold scores for comparison, the results show that 0.4 threshold get the best performance.

Table 2. Comparison with different IoU thresholds for NMS

4.3 Comparison with State-of-the-Art

We report the results comparisons on VOC 2012 test dataset in Table 3. We compare with two-stage detectors, R-CNN [15], Fast R-CNN [3], Faster R-CNN [4], and one-stage detector YOLOV1 [5]. Our OSAF_e detector uses 0.5 threshold for NMS, C2 layer for detection and also considers effective area.

Table 3. Part of the leaderboard in OVC 2012 test set.

As can be seen, the mAP criterion scores of OSAF_e is 5.5% higher than YOLOV1 [5]. Among them, the AP of bike, bottle, chair, plant, and train categories are all improved more than 10% than YOLOv1 [5]. Moreover, OSAF_e far exceeds R-CNN [15]. But is weaker than two-stage based detectors, like Fast R-CNN [3] and Faster R-CNN [4]. The reason may be that suing simple NMS method for generating bounding boxes predication results cannot work well.

5 Conclusion

We have proposed a new one-stage anchor-free object detector OSAF_e, which solves the object detection in a per-pixel prediction fashion. As shown in experiments, OSAF_e outperforms YOLOv1 [5] and R-CNN [15] anchor-free method. It also avoids computation and hyper-parameters related to anchor boxes. This algorithm will be migrated to the feature pyramid networks [17] in subsequent work.