1 Introduction

In this era of deep learning, most object detection models with state-of-the-art performance are based on a two-stage scheme [1, 5, 7, 8, 25, 26], where a sparse set of proposals are generated at the first stage, followed by regional object classification and coordinate regression at the second stage. The process of generating proposals has developed from off-line methods, such as Selective Search [15] and objectness [16], to integrated learning ones [1, 18, 19], in which Region Proposal Network (RPN) has become a standard component of these state-of-the-art two-stage methods. During the training of RPN, candidate proposals are first sampled among pre-located dense anchors, and then fed to the classifier of object/not-object and regressor. Within those dense anchors, the samples with class object/not-object are very imbalanced, particularly, the samples of not-object are much more than the ones of object, which make it difficult to train a classifier with regular policies. Thus, as a usual strategy, only a fixed number of anchors with a fixed object/not-object ratio, e.g., 256 and 1:1 [1], are sampled for training. Although such constraint in sampling progress can balance the samples, it also results in losing the diversity of proposals. Instead of constraining the sampling progress, we investigate this imbalance problem from the aspect of designing desired loss function during training.

The class imbalance problem is also encountered in one-stage detection models [3, 11,12,13], in which different types of example sampling strategies [11, 14, 23] are proposed to address this problem. However, Lin et al. [3] claims that it is the vast number of easy samples that overwhelms the detectors. Thus, they propose to take all pre-located dense anchors for training with a dynamically cross entropy loss, called focal loss, which prevents these easy samples from overwhelming the training process by down-weighting the losses of easy samples.

In this paper, we investigate the adaptation of focal loss to RPN (see Sect. 3.3), such that much more samples can be included for training while free of the training problems caused by class imbalance. By replacing standard cross entropy loss in RPN with focal loss, RPN can be trained directly with no need for specially designed sampling strategies. Besides, due to the full convolutional implementation of RPN, no extra computation cost is required. We take Faster R-CNN [1, 2] as our baseline model and conduct the experiments on PASCAL VOC 2007 [24] and COCO [10] detection benchmarks. The experimental results show the efficiency of the proposed method, implying that this sampling free strategy can be directly applied to RPN, so as to all the state-of-the-art two-stage detectors.

2 Related Work

Two-Stage Detectors. With the fast development of deep leaning [9] over past few years, two-stage object detectors [1, 4,5,6,7,8, 25, 26] have become one of the fashion of object detection methods. In the two-stage methods, a sparse set of candidate proposals with high probabilities of containing objects are first generated [1, 15, 18, 19], followed by a second stage of object classification and coordinate regression. Empowered with deep neural networks [9, 20,21,22] and a series of improvements in both speed and accuracy [1, 4, 6, 7], the whole detection system is integrated into a single network, i.e., the widely-used Faster R-CNN [1] framework. Many works to extend this framework have been conducted [5, 8, 25, 26]. We also utilize Faster R-CNN as our base model to investigate the adaptation of focal loss to RPN in this paper.

Region Proposal Methods. As the first stage in the two-stage scheme, region proposal methods have been developed from pioneering off-line methods, such as Selective Search [15] and objectness [16], to integrated learning ones [1, 18, 19], in which RPN integrated this proposal process into the base networks by sharing their convolutional layers. During training, dense anchors are pre-located first, to which RPN applies object/not-object classification and class-agnostic regression, while for inference, it generates a sparse set of proposals for the second stage by applying coordinate refinements and non-maximum suppression (NMS) to the dense anchors. RPN enables the end-to-end training of the two-stage detectors, and has become one of their components.

It is worth to note that not all the pre-located anchors are employed for training due to its class imbalance problem, that is, majority of the dense anchors are easy samples with class not-object. And if all these anchors are taken into account, they would overwhelm the detector during training. In this paper, we focus on this class imbalance problem in RPN.

Class Imbalance. As same as RPN in two-stage detectors, one-stage detectors also encounter class imbalance during training [3, 11,12,13], and some types of example sampling strategies are often the employed solutions [11, 14, 23]. In contrast, Lin et al. [3] propose a novel type of loss function, called focal loss, to down-weight the losses of easy samples, so as to include all samples for training and handle the class imbalance. Inspired by this work, we try to adapt focal loss to RPN such that we can also avoid the sampling process during the training of RPN.

Loss Function Design. There are two tasks, classification (cls) and bounding box regression (reg), in both first and second stage of these two-stage methods, which classifies the anchors/proposals to a specific class and regresses the bouncing boxes, respectively. The cls loss is taken as standard cross entropy loss, while for binary cls, it is shown as:

$$\begin{aligned} CE\left( p,y\right)&=\frac{1}{N_{cls}}\sum _{i}CE\left( p_{i},y_{i}\right) \nonumber \\&=\frac{1}{N_{cls}}\sum _{i}y_{i}\log (p_{i})+\left( 1-y_{i}\right) \log \left( 1-p_{i}\right) \end{aligned}$$
(1)

where \(y_{i}\) is the label, \(p_{i}\) is the estimated probability for each sample, and \(N_{cls}\) is the number of samples and taken as a normalization term. For multi-class cls task, the cross entropy loss can be extended straightforwardly.

For the reg task, smoothed \(L_{1}\) loss [7] is applied as:

$$\begin{aligned} smooth_{L_{1}}\left( x\right) ={\left\{ \begin{array}{ll} \begin{array}{cc} 0.5x^{2\,\,\,\,\,} &{} if\,\left| x\right| \le \text {1}\,\\ \left| x\right| -0.5 &{} \,otherwise \end{array}\end{array}\right. } \end{aligned}$$
(2)

where x is the difference between anchors/proposals and bounding boxes of ground true.

We note that the cls task is object/not-object binary classification in the first stage, i.e., RPN, while in the second stage, it is taken as multi-class ones to classify foreground classes/background. For the reg tasks in both stages, the smooth \(L_{1}\) loss is only computed on anchors/proposals belong to object/foreground classes.

We follow the literature and use these losses in our model except that we use focal loss in cls task of RPN instead of cross entropy loss, such that we can include much more anchors for training.

3 Focal Loss for RPN

As a Region Proposal Network (RPN) based detection model, Faster R-CNN [1] is taken as our base model for evaluating the adaption of focal loss to RPN. In the following of this section, we will briefly review RPN in Faster R-CNN (Sect. 3.1), focal loss [3] applied in detection models (Sect. 3.2), and finally introduce our focal loss equipped RPN (Sect. 3.3).

Fig. 1.
figure 1

The training process of Faster RCNN with focal loss. The blue/dashed lines indicate the generation/feeding of anchors or bounding boxes. (Color figure online)

3.1 RPN in Faster R-CNN

Faster R-CNN is a widely-used two-stage detection model which integrates RPN to generate proposal regions, enabling an end-to-end detection model. Based on RPN, the two-stage detection approaches develop fast and achieve good performance in recent years [1, 5, 8, 25, 26].

As Fig. 1 shown, RPN shares convolutional (conv) layers with base detection network, e.g., first 5 conv layers in Zeiler and Fergus model (ZF net) [20], 13 in VGG16 [21] and first 4 blocks in ResNet [22]. On the top of these shared conv layers, RPN is included as external branch for cls and reg, consisting of an \(3*3\) conv layer followed by two sibling fully-connected layers (or \(1*1\) conv layers) for cls and reg, respectively. Note that, RPN only classifies object/not-object for each anchor, where we also apply sigmoid (1 for object and 0 for not-object) and softmax (as usual in two-stage detectors) for our focal loss adaptation, which will be introduced in Sect. 3.3. Besides, RPN regresses bounding boxes via refining pre-fixed anchors, which are centered at each position of the top shared conv layer. k anchors at each position are taken according to different scales and aspect ratios, e.g., 3 scales and 3 aspect ratios result in \(k=9\) anchors in [1]. Therefore, with a typical image scale \(\sim 600*1000\) and feature stride 16 of the shared conv layers [20,21,22], \(\sim 20,000\) anchors are obtained in total, in which the numbers of object/not-object are very imbalanced, e.g., \(\sim 1:1000\). However, only fixed number of anchors are sampled for training to ensure a relative balanced samples (in [1], 256 anchors with ratio 1 : 1). The loss function of RPN is formulated as:

$$\begin{aligned} L_{RPN}=\frac{1}{N_{cls}}\sum _{i}CE\left( p_{i},p_{i}^{t}\right) +\frac{1}{N_{reg}}\sum _{i}I\left( t_{i}^{t}\right) L_{reg}\left( t_{i},t_{i}^{t}\right) \end{aligned}$$
(3)

where \(N_{cls}\) and \(N_{reg}\) are the normalization terms, e.g., 256 in [1], and \(p_{i}^{t}\) and \(t_{i}^{t}\) are the cls label and reg target, respectively. The first term of Eq. (3) stands for standard cross entropy loss, while the second stands for the reg loss, where standard smooth \(L_{1}\) loss [7] is applied, and \(I\left( t_{i}^{t}\right) \) is an indicator function. The loss here is only computed on the sampled anchors.

3.2 Focal Loss for Detection

Different to two-stage methods, one-stage detection models [3, 11,12,13] do not generate proposal first, but directly classify and regress the anchors (or priors) to the class and bounding boxes of ground true like RPN, respectively. The detection results are obtained in a single run, making them more efficient in the speed of detection. However, they also suffer the same imbalanced sample problem as RPN, and some types of examples sampling [11, 14, 23] are often the applied solutions. In [3], all pre-located anchors are used for training instead of a relative small number of sampled ones. The authors claim the affects of the imbalanced problem is that the accumulated loss from the vast number of easy samples overwhelms the detector [3]. Therefore, in order to address with this imbalanced problem, it proposed focal loss to down-weight the loss of the easy samples. Focal loss is a dynamically scaled cross entropy loss, which can be formulated as:

$$\begin{aligned} FL(p_{t})=-\alpha _{t}(1-p_{t})^{\gamma }\log (p_{t}) \end{aligned}$$
(4)

where for binary classification, \(p_{t}{\epsilon [0, 1]}\) is the probability for the ground true class, \(\alpha _{t}{\epsilon [0, 1]}\) the re-weighting factor to balance positive and negative samples, and \(\gamma {\,\ge \,0}\) a hyper-parameter. Note that, when \(\alpha _{t}=0.5,\gamma =0\), focal loss deforms to standard cross entropy loss.

As in Eq. (4), for those easy samples (\(p_{t}\) close to 1), the scale term \((1-p_{t})^{\gamma }\) down-weights the loss greatly; thus, it leads the model to focus more on hard samples. Through this dynamically scaled loss, the model can avoid the problem of the model being overwhelmed by much more easy samples, so as to include all the anchors for training.

3.3 Focal Loss for RPN

To investigate the application of focal loss to RPN, we re-formulate the loss of RPN with focal loss as:

$$\begin{aligned} L_{RPN-FL}=\frac{\lambda _{fl}}{N_{cls}^{'}}\sum _{i}FL\left( p_{i}^{t}\right) +\frac{1}{N_{reg}^{'}}\sum _{i}I\left( t_{i}^{t}\right) L_{reg}\left( t_{i},t_{i}^{t}\right) \end{aligned}$$
(5)

where we simply the replace the cross entropy loss with focal loss and use all anchors (\(\sim 20,000\) per image) for training instead of those sampled. \(\lambda _{fl}\) is served as a balancing weight. Note that, in the first term of Eq. (5), we set \(N_{cls}^{'}=|p_{i}^{t}{\epsilon }\,object|\), which means the cls loss is normalized with number of object samples in this dense anchor scenario, while in the second term, we set \(N_{reg}^{'}=2*|p_{i}^{t}{\epsilon }\,object|\).

Figure 1 illustrates our adaptation of focal loss in RPN. In contrast to only training with a part of anchors as previous works [1, 8, 25, 26], all the generated dense anchors are taken for training with our adaptive focal loss. The focal loss equipped RPN is integrated into Faster R-CNN framework [1, 2] in the following form:

$$\begin{aligned} L=L_{RPN-FL}+L_{RCNN} \end{aligned}$$
(6)

where \(L_{RCNN}\) includes multi-class cls loss and class-aware reg loss, and we do not modify it so as to verify the effect of focal loss applied in RPN on the whole detection system.

To get the probability \(p_{t}\) in Eq. (4), we utilize two output functions, softmax and sigmoid. For output with softmax, we get two scores \(\left[ p_{p},p_{n}\right] \) implying object and not-object, respectively, and get \(p_{t}=p_{p}\) if the anchor matches with object label, while \(p_{t}=p_{n}\) if the anchor matches with not-object label. For output with sigmoid, only one score \(p_{s}\) is get and \(p_{t}=p_{s}\) for object label, while for not-object label \(p_{t}=1-p_{s}\). These two output function will be compared in the following experiments.

Implement Details. This work is based on the public TensorFlow implementation of Faster R-CNNFootnote 1[2], and we follow most of the parameter settings from the original implementation. We use stochastic gradient descent (SGD) for optimization and set momentum as 0.9 and weight decay as 0.0001. The model is trained with one image per iteration following [2], and the only data augmentation strategy is to randomly flip the training images. ImageNet [9] pre-trained VGG16 [21] is used as our base network, and the conv1 and conv2 layers are fixed.

We set the base learning rate as 0.001 for first 50k/350k iteration and decrease by 10 for next 20k/140k for PASCAL VOC 2007/COCO datasets. For the hyper-parameters, we set \(\alpha _{t}=0.25,\gamma =2\) and \(\lambda _{fl}=0.1\) by default, and they will be evaluated in the following experiments.

4 Experiments

We evaluate our model on PASCAL VOC 2007 [24] and COCO [10] detection benchmarks and follow the standard data splits. Average precision (AP) is reported following the literature. An image scale of 600 pixels is applied for both training and test [1, 2]. Note that, for fair comparison, we only modify the loss function and do not include any additional parameters in all our experiments, except the model of sigmoid output contains less parameters, where we reduce the output from two to one.

PASCAL VOC 2007. PASCAL VOC [24] has been a classical dataset for computer vision tasks, e.g., classification and detection and segmentation. In the following experiments, we also utilize this dataset for evaluating our model. It contains 20 object categories for detection, and there are 2.47 objects in each image in average. We use the trainval split for training, and test split for evaluation. which consist of 5,011 and 4,952 images, respectively. Average precision (AP) is reported with the IOU threshold set as 0.5.

COCO 2014. As a more complicate dataset, COCO [10] has been a challenging benchmarks of object detection, and is most widely-used for evaluating various detection models. It contains 80 object categories for detection, and there are 7.58 objects in each image in average. We use COCO 2014 in our experiments, which contains of 82,783 images for training, 40,504 for validation and 40,775 for test. Due to the unavailable of the ground true of test split, we follow the literature [2, 10] to re-split the dataset to train+valminumsminival and minival. During test, COCO employs a more strict metric, where average precision (AP) is computed with different IOU thresholds, i.e., \(\left[ 0.5:0.95\right] \) and report their average. Besides, the performances for different scales, i.e., small/middle/large, are also reported.

Table 1. Parameters evaluation Detection average precision (%). All use faster R-CNN on VGG16. For each column, we only change the corresponding parameter and keep others as default. The missing values mean that the model failed in those settings.

4.1 Parameters Evaluation

We evaluate the hyper-parameters in Table 1. FL-softmax and FL-sigmoid stand for Faster R-CNN with focal loss equipped RPN which output with softmax and sigmoid, respectively, as introduced in Sect. 3.3. As the table shown, FL-softmax always gain a higher performance than FL-sigmoid, while the latter performs much more stable under different parameter settings. We assume that it is the saturation of sigmoid function that leads the model less sensitive to the hyper-parameters and also stuck the optimization process, which result in inferior performances. For those several failed scenarios in FL-softmax, the large scale of focal loss computed on all anchors may be the cause, e.g., small exponent \(\gamma \le 1\) or large loss scale \(\lambda _{fl}\ge 0.2\) could result in the exposure of loss and further hurt the optimization process. Thus, for FL-softmax, we should design the hyper-parameters more carefully to make the computed focal loss in a reasonable scale, so as to train the model correctly.

Table 2. VOC 2007 test Detection average precision (%). These models use the default hyper-parameters except that \(\gamma =1\) in FL-sigmoid. In baseline+FL, we combine focal loss with the original RPN. *Baseline we trained using the public implementation.

4.2 Performance Comparison

Table 2 shows the detection results of baseline and our models which are adapted with focal loss. The performances are comparable to the baseline, implying that focal loss can be modified to apply in RPN directly to replace the sampling mechanism, but only with a mirror impact on the performance.

As Table 2 shown, however, when slightly changing the mAP metric (0.1% lower in FL-sigmoid and 0.3% higher in FL-softmax), the performance of each class changes obviously, e.g., obtaining 2.9% lower for ‘table’ class and 4.1% higher for ‘cat’ class in FL-softmax, which may indicate that focal loss is complementary to standard cross entropy loss. Inspired by this, we simply add focal loss to the original RPN, denoted as baseline+FL in Table 2, which obtains the same performance as FL-softmax and also mirror improvements over baseline. Specifically, in baseline+FL, focal loss is computed on all anchors as before while cross entropy loss is computed on sampled anchors, and these two losses are directly combined by average. Figure 3 displays some examples on PASCAL VOC 2007 detected by model baseline+FL, where we get the satisfactory results with a wide range of scales and aspect ratios.

Fig. 2.
figure 2

Loss curves of RPN. Left: cls loss. Right: reg loss. The loss curves in baseline+FL display similar to baseline, so we omit them for simplicity.

4.3 Training Process in RPN

To further analyze the influence of focal loss on RPN, we plot the cls and reg losses during training in Fig. 2. In the cls loss curve, the two focal loss equipped RPNs converge much faster and more stable than baseline. This effect is benefited from the intrinsic characteristic of focal loss that it is capable of training with much more anchors. For the reg loss curve, however, these two models perform worse than baseline; they are much unstable and have large scale. This may be the reason why focal loss can not boost RPN (and Faster R-CNN) greatly like one-stage detection model [3], e.g., after we get the satisfied scores for all the anchors, these anchors can not be refined well to produce satisfied proposals for R-CNN, which may affect the performance of the whole detection system. This may implies that the training signals produced by focal loss is conflict to those from bounding box regression in some terms.

Besides, it is worth to note that, RetinaNet [3], the network first applied focal loss to detection, decouples the cls and reg tasks into two sub-networks, and thus avoids this conflict signals problem. In this work, we only follow the original design of RPN where these two tasks share the same networks except the task specific layers. Thus, decoupling cls and reg tasks like RetinaNet in our focal loss equipped RPN may further improves the model performance. Other ways to make focal loss more compatible with bounding box regression can also be taken into consideration, and this will be our future work.

Table 3. COCO 2014 minival object detection average precision (%). Legend same as Table 2. *Baseline we trained using the public implementation.
Fig. 3.
figure 3

Examples on PASCAL VOC 2007 detected by baseline+FL. The score threshold for display is set as 0.5

4.4 More Results on COCO

We also conduct experiments of the focal loss equipped RPN in COCO 2014 dataset [10], where we use train+valminumsminival and minival split following [2, 10]. As Table 3 shown, FL-softmax performs comparable to baseline, while baseline+FL is superior in all the metrics. In terms of the performance difference of baseline+FL in these two datasets, we assume that it is the difference between the statistics of each dataset that counts; COCO contains much more objects in each image than PASCAL VOC 2007 (7.58 vs 2.47 in average), which may results in differences in the training process, i.e., the anchors for computing focal loss in COCO is not such imbalanced like PASCAL VOC 2007. That is, in dataset with dense objects, such as COCO, focal loss combined with standard cross entropy loss may work better than either of them alone.

In other aspects, the original implementation [2] claims that the performance on COCO could continue to improve if we train with more iterations, e.g., 900k/1190k; thus the reason why baseline+FL performs better than baseline and FL-softmax may be its fast convergence characteristic. So, whether the statistic difference or convergence characteristic contribute to performance difference is further to be explored.

We note that the training processes also display the same trends as in Sect. 4.3. And these experimental results show that focal loss is also adaptable to more complicate datasets.

5 Conclusion

In this work, we investigate how to adapt focal loss to train RPN without applying the sampling strategy. By down-weighting the losses of those vast numbers of easy samples, focal loss can intrinsically handle the class imbalance problem and prevent their losses from overwhelming the detector. Using focal loss is capable of including much more samples for training. Thus, RPN can also take all anchors into account for training via replacing standard cross entropy loss with focal loss or simply combining them. As the experiments conducted on PASCAL VOC 2007 and COCO shown, it is feasible to train RPN without particularly designed sampling. We also discuss the compatibility between focal loss and bounding box regression in RPN, and this is left as future work.