Focal Loss for Region Proposal Network

Chen, Chengpeng; Song, Xinhang; Jiang, Shuqiang

doi:10.1007/978-3-030-03335-4_32

Focal Loss for Region Proposal Network

Chengpeng Chen^19,20,
Xinhang Song^19,20 &
Shuqiang Jiang^19,20

Conference paper
First Online: 02 November 2018

2880 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11257))

Abstract

Currently, most state-of-the-art object detection models are based on a two-stage scheme pioneered by R-CNN and integrated with region proposal network (RPN), which is served as proposal generation. During the training of RPN, only a fixed number of samples with a fixed object/not-object ratio are sampled to avoid class imbalance problem. In contrast to the sampling strategies, focal loss is utilized to solve the class imbalance problem by down-weighting the losses of vast number of easy samples, which is encountered in one-stage detection methods. Inspired by this, we investigate the adaptation of focal loss to RPN in this paper, which allow us to train RPN free of the sampling process. Based on Faster R-CNN, we adapt focal loss to RPN and the experimental results on PASCAL VOC 2007 and COCO datasets outperform the baseline, which shows the efficiency of the proposed method and implies that focal loss can be applied to RPN directly.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

In this era of deep learning, most object detection models with state-of-the-art performance are based on a two-stage scheme [1, 5, 7, 8, 25, 26], where a sparse set of proposals are generated at the first stage, followed by regional object classification and coordinate regression at the second stage. The process of generating proposals has developed from off-line methods, such as Selective Search [15] and objectness [16], to integrated learning ones [1, 18, 19], in which Region Proposal Network (RPN) has become a standard component of these state-of-the-art two-stage methods. During the training of RPN, candidate proposals are first sampled among pre-located dense anchors, and then fed to the classifier of object/not-object and regressor. Within those dense anchors, the samples with class object/not-object are very imbalanced, particularly, the samples of not-object are much more than the ones of object, which make it difficult to train a classifier with regular policies. Thus, as a usual strategy, only a fixed number of anchors with a fixed object/not-object ratio, e.g., 256 and 1:1 [1], are sampled for training. Although such constraint in sampling progress can balance the samples, it also results in losing the diversity of proposals. Instead of constraining the sampling progress, we investigate this imbalance problem from the aspect of designing desired loss function during training.

The class imbalance problem is also encountered in one-stage detection models [3, 11,12,13], in which different types of example sampling strategies [11, 14, 23] are proposed to address this problem. However, Lin et al. [3] claims that it is the vast number of easy samples that overwhelms the detectors. Thus, they propose to take all pre-located dense anchors for training with a dynamically cross entropy loss, called focal loss, which prevents these easy samples from overwhelming the training process by down-weighting the losses of easy samples.

In this paper, we investigate the adaptation of focal loss to RPN (see Sect. 3.3), such that much more samples can be included for training while free of the training problems caused by class imbalance. By replacing standard cross entropy loss in RPN with focal loss, RPN can be trained directly with no need for specially designed sampling strategies. Besides, due to the full convolutional implementation of RPN, no extra computation cost is required. We take Faster R-CNN [1, 2] as our baseline model and conduct the experiments on PASCAL VOC 2007 [24] and COCO [10] detection benchmarks. The experimental results show the efficiency of the proposed method, implying that this sampling free strategy can be directly applied to RPN, so as to all the state-of-the-art two-stage detectors.

2 Related Work

Two-Stage Detectors. With the fast development of deep leaning [9] over past few years, two-stage object detectors [1, 4,5,6,7,8, 25, 26] have become one of the fashion of object detection methods. In the two-stage methods, a sparse set of candidate proposals with high probabilities of containing objects are first generated [1, 15, 18, 19], followed by a second stage of object classification and coordinate regression. Empowered with deep neural networks [9, 20,21,22] and a series of improvements in both speed and accuracy [1, 4, 6, 7], the whole detection system is integrated into a single network, i.e., the widely-used Faster R-CNN [1] framework. Many works to extend this framework have been conducted [5, 8, 25, 26]. We also utilize Faster R-CNN as our base model to investigate the adaptation of focal loss to RPN in this paper.

Region Proposal Methods. As the first stage in the two-stage scheme, region proposal methods have been developed from pioneering off-line methods, such as Selective Search [15] and objectness [16], to integrated learning ones [1, 18, 19], in which RPN integrated this proposal process into the base networks by sharing their convolutional layers. During training, dense anchors are pre-located first, to which RPN applies object/not-object classification and class-agnostic regression, while for inference, it generates a sparse set of proposals for the second stage by applying coordinate refinements and non-maximum suppression (NMS) to the dense anchors. RPN enables the end-to-end training of the two-stage detectors, and has become one of their components.

It is worth to note that not all the pre-located anchors are employed for training due to its class imbalance problem, that is, majority of the dense anchors are easy samples with class not-object. And if all these anchors are taken into account, they would overwhelm the detector during training. In this paper, we focus on this class imbalance problem in RPN.

Class Imbalance. As same as RPN in two-stage detectors, one-stage detectors also encounter class imbalance during training [3, 11,12,13], and some types of example sampling strategies are often the employed solutions [11, 14, 23]. In contrast, Lin et al. [3] propose a novel type of loss function, called focal loss, to down-weight the losses of easy samples, so as to include all samples for training and handle the class imbalance. Inspired by this work, we try to adapt focal loss to RPN such that we can also avoid the sampling process during the training of RPN.

Loss Function Design. There are two tasks, classification (cls) and bounding box regression (reg), in both first and second stage of these two-stage methods, which classifies the anchors/proposals to a specific class and regresses the bouncing boxes, respectively. The cls loss is taken as standard cross entropy loss, while for binary cls, it is shown as:

$$\begin{aligned} CE\left( p,y\right)&=\frac{1}{N_{cls}}\sum _{i}CE\left( p_{i},y_{i}\right) \nonumber \\&=\frac{1}{N_{cls}}\sum _{i}y_{i}\log (p_{i})+\left( 1-y_{i}\right) \log \left( 1-p_{i}\right) \end{aligned}$$

(1)

where $y_{i}$ is the label, $p_{i}$ is the estimated probability for each sample, and $N_{cls}$ is the number of samples and taken as a normalization term. For multi-class cls task, the cross entropy loss can be extended straightforwardly.

For the reg task, smoothed $L_{1}$ loss [7] is applied as:

$$\begin{aligned} smooth_{L_{1}}\left( x\right) ={\left\{ \begin{array}{ll} \begin{array}{cc} 0.5x^{2\,\,\,\,\,} &{} if\,\left| x\right| \le \text {1}\,\\ \left| x\right| -0.5 &{} \,otherwise \end{array}\end{array}\right. } \end{aligned}$$

(2)

where x is the difference between anchors/proposals and bounding boxes of ground true.

We note that the cls task is object/not-object binary classification in the first stage, i.e., RPN, while in the second stage, it is taken as multi-class ones to classify foreground classes/background. For the reg tasks in both stages, the smooth $L_{1}$ loss is only computed on anchors/proposals belong to object/foreground classes.

We follow the literature and use these losses in our model except that we use focal loss in cls task of RPN instead of cross entropy loss, such that we can include much more anchors for training.

3 Focal Loss for RPN

As a Region Proposal Network (RPN) based detection model, Faster R-CNN [1] is taken as our base model for evaluating the adaption of focal loss to RPN. In the following of this section, we will briefly review RPN in Faster R-CNN (Sect. 3.1), focal loss [3] applied in detection models (Sect. 3.2), and finally introduce our focal loss equipped RPN (Sect. 3.3).

3.1 RPN in Faster R-CNN

Faster R-CNN is a widely-used two-stage detection model which integrates RPN to generate proposal regions, enabling an end-to-end detection model. Based on RPN, the two-stage detection approaches develop fast and achieve good performance in recent years [1, 5, 8, 25, 26].

As Fig. 1 shown, RPN shares convolutional (conv) layers with base detection network, e.g., first 5 conv layers in Zeiler and Fergus model (ZF net) [20], 13 in VGG16 [21] and first 4 blocks in ResNet [22]. On the top of these shared conv layers, RPN is included as external branch for cls and reg, consisting of an $3*3$ conv layer followed by two sibling fully-connected layers (or $1*1$ conv layers) for cls and reg, respectively. Note that, RPN only classifies object/not-object for each anchor, where we also apply sigmoid (1 for object and 0 for not-object) and softmax (as usual in two-stage detectors) for our focal loss adaptation, which will be introduced in Sect. 3.3. Besides, RPN regresses bounding boxes via refining pre-fixed anchors, which are centered at each position of the top shared conv layer. k anchors at each position are taken according to different scales and aspect ratios, e.g., 3 scales and 3 aspect ratios result in $k=9$ anchors in [1]. Therefore, with a typical image scale $\sim 600*1000$ and feature stride 16 of the shared conv layers [20,21,22], $\sim 20,000$ anchors are obtained in total, in which the numbers of object/not-object are very imbalanced, e.g., $\sim 1:1000$. However, only fixed number of anchors are sampled for training to ensure a relative balanced samples (in [1], 256 anchors with ratio 1 : 1). The loss function of RPN is formulated as:

$$\begin{aligned} L_{RPN}=\frac{1}{N_{cls}}\sum _{i}CE\left( p_{i},p_{i}^{t}\right) +\frac{1}{N_{reg}}\sum _{i}I\left( t_{i}^{t}\right) L_{reg}\left( t_{i},t_{i}^{t}\right) \end{aligned}$$

(3)

where $N_{cls}$ and $N_{reg}$ are the normalization terms, e.g., 256 in [1], and $p_{i}^{t}$ and $t_{i}^{t}$ are the cls label and reg target, respectively. The first term of Eq. (3) stands for standard cross entropy loss, while the second stands for the reg loss, where standard smooth $L_{1}$ loss [7] is applied, and $I\left( t_{i}^{t}\right) $ is an indicator function. The loss here is only computed on the sampled anchors.

3.2 Focal Loss for Detection

Different to two-stage methods, one-stage detection models [3, 11,12,13] do not generate proposal first, but directly classify and regress the anchors (or priors) to the class and bounding boxes of ground true like RPN, respectively. The detection results are obtained in a single run, making them more efficient in the speed of detection. However, they also suffer the same imbalanced sample problem as RPN, and some types of examples sampling [11, 14, 23] are often the applied solutions. In [3], all pre-located anchors are used for training instead of a relative small number of sampled ones. The authors claim the affects of the imbalanced problem is that the accumulated loss from the vast number of easy samples overwhelms the detector [3]. Therefore, in order to address with this imbalanced problem, it proposed focal loss to down-weight the loss of the easy samples. Focal loss is a dynamically scaled cross entropy loss, which can be formulated as:

$$\begin{aligned} FL(p_{t})=-\alpha _{t}(1-p_{t})^{\gamma }\log (p_{t}) \end{aligned}$$

(4)

where for binary classification, $p_{t}{\epsilon [0, 1]}$ is the probability for the ground true class, $\alpha _{t}{\epsilon [0, 1]}$ the re-weighting factor to balance positive and negative samples, and $\gamma {\,\ge \,0}$ a hyper-parameter. Note that, when $\alpha _{t}=0.5,\gamma =0$, focal loss deforms to standard cross entropy loss.

As in Eq. (4), for those easy samples ($p_{t}$ close to 1), the scale term $(1-p_{t})^{\gamma }$ down-weights the loss greatly; thus, it leads the model to focus more on hard samples. Through this dynamically scaled loss, the model can avoid the problem of the model being overwhelmed by much more easy samples, so as to include all the anchors for training.

3.3 Focal Loss for RPN

To investigate the application of focal loss to RPN, we re-formulate the loss of RPN with focal loss as:

$$\begin{aligned} L_{RPN-FL}=\frac{\lambda _{fl}}{N_{cls}^{'}}\sum _{i}FL\left( p_{i}^{t}\right) +\frac{1}{N_{reg}^{'}}\sum _{i}I\left( t_{i}^{t}\right) L_{reg}\left( t_{i},t_{i}^{t}\right) \end{aligned}$$

(5)

where we simply the replace the cross entropy loss with focal loss and use all anchors ($\sim 20,000$ per image) for training instead of those sampled. $\lambda _{fl}$ is served as a balancing weight. Note that, in the first term of Eq. (5), we set $N_{cls}^{'}=|p_{i}^{t}{\epsilon }\,object|$, which means the cls loss is normalized with number of object samples in this dense anchor scenario, while in the second term, we set $N_{reg}^{'}=2*|p_{i}^{t}{\epsilon }\,object|$.

Figure 1 illustrates our adaptation of focal loss in RPN. In contrast to only training with a part of anchors as previous works [1, 8, 25, 26], all the generated dense anchors are taken for training with our adaptive focal loss. The focal loss equipped RPN is integrated into Faster R-CNN framework [1, 2] in the following form:

$$\begin{aligned} L=L_{RPN-FL}+L_{RCNN} \end{aligned}$$

(6)

where $L_{RCNN}$ includes multi-class cls loss and class-aware reg loss, and we do not modify it so as to verify the effect of focal loss applied in RPN on the whole detection system.

To get the probability $p_{t}$ in Eq. (4), we utilize two output functions, softmax and sigmoid. For output with softmax, we get two scores $\left[ p_{p},p_{n}\right] $ implying object and not-object, respectively, and get $p_{t}=p_{p}$ if the anchor matches with object label, while $p_{t}=p_{n}$ if the anchor matches with not-object label. For output with sigmoid, only one score $p_{s}$ is get and $p_{t}=p_{s}$ for object label, while for not-object label $p_{t}=1-p_{s}$. These two output function will be compared in the following experiments.

Implement Details. This work is based on the public TensorFlow implementation of Faster R-CNN^{Footnote 1}[2], and we follow most of the parameter settings from the original implementation. We use stochastic gradient descent (SGD) for optimization and set momentum as 0.9 and weight decay as 0.0001. The model is trained with one image per iteration following [2], and the only data augmentation strategy is to randomly flip the training images. ImageNet [9] pre-trained VGG16 [21] is used as our base network, and the conv1 and conv2 layers are fixed.

We set the base learning rate as 0.001 for first 50k/350k iteration and decrease by 10 for next 20k/140k for PASCAL VOC 2007/COCO datasets. For the hyper-parameters, we set $\alpha _{t}=0.25,\gamma =2$ and $\lambda _{fl}=0.1$ by default, and they will be evaluated in the following experiments.

4 Experiments

We evaluate our model on PASCAL VOC 2007 [24] and COCO [10] detection benchmarks and follow the standard data splits. Average precision (AP) is reported following the literature. An image scale of 600 pixels is applied for both training and test [1, 2]. Note that, for fair comparison, we only modify the loss function and do not include any additional parameters in all our experiments, except the model of sigmoid output contains less parameters, where we reduce the output from two to one.

PASCAL VOC 2007. PASCAL VOC [24] has been a classical dataset for computer vision tasks, e.g., classification and detection and segmentation. In the following experiments, we also utilize this dataset for evaluating our model. It contains 20 object categories for detection, and there are 2.47 objects in each image in average. We use the trainval split for training, and test split for evaluation. which consist of 5,011 and 4,952 images, respectively. Average precision (AP) is reported with the IOU threshold set as 0.5.

COCO 2014. As a more complicate dataset, COCO [10] has been a challenging benchmarks of object detection, and is most widely-used for evaluating various detection models. It contains 80 object categories for detection, and there are 7.58 objects in each image in average. We use COCO 2014 in our experiments, which contains of 82,783 images for training, 40,504 for validation and 40,775 for test. Due to the unavailable of the ground true of test split, we follow the literature [2, 10] to re-split the dataset to train+valminumsminival and minival. During test, COCO employs a more strict metric, where average precision (AP) is computed with different IOU thresholds, i.e., $\left[ 0.5:0.95\right] $ and report their average. Besides, the performances for different scales, i.e., small/middle/large, are also reported.

Table 1. Parameters evaluation Detection average precision (%). All use faster R-CNN on VGG16. For each column, we only change the corresponding parameter and keep others as default. The missing values mean that the model failed in those settings.

Full size table

4.1 Parameters Evaluation

We evaluate the hyper-parameters in Table 1. FL-softmax and FL-sigmoid stand for Faster R-CNN with focal loss equipped RPN which output with softmax and sigmoid, respectively, as introduced in Sect. 3.3. As the table shown, FL-softmax always gain a higher performance than FL-sigmoid, while the latter performs much more stable under different parameter settings. We assume that it is the saturation of sigmoid function that leads the model less sensitive to the hyper-parameters and also stuck the optimization process, which result in inferior performances. For those several failed scenarios in FL-softmax, the large scale of focal loss computed on all anchors may be the cause, e.g., small exponent $\gamma \le 1$ or large loss scale $\lambda _{fl}\ge 0.2$ could result in the exposure of loss and further hurt the optimization process. Thus, for FL-softmax, we should design the hyper-parameters more carefully to make the computed focal loss in a reasonable scale, so as to train the model correctly.

Table 2. VOC 2007 test Detection average precision (%). These models use the default hyper-parameters except that $\gamma =1$ in FL-sigmoid. In baseline+FL, we combine focal loss with the original RPN. *Baseline we trained using the public implementation.

Full size table

4.2 Performance Comparison

Table 2 shows the detection results of baseline and our models which are adapted with focal loss. The performances are comparable to the baseline, implying that focal loss can be modified to apply in RPN directly to replace the sampling mechanism, but only with a mirror impact on the performance.

As Table 2 shown, however, when slightly changing the mAP metric (0.1% lower in FL-sigmoid and 0.3% higher in FL-softmax), the performance of each class changes obviously, e.g., obtaining 2.9% lower for ‘table’ class and 4.1% higher for ‘cat’ class in FL-softmax, which may indicate that focal loss is complementary to standard cross entropy loss. Inspired by this, we simply add focal loss to the original RPN, denoted as baseline+FL in Table 2, which obtains the same performance as FL-softmax and also mirror improvements over baseline. Specifically, in baseline+FL, focal loss is computed on all anchors as before while cross entropy loss is computed on sampled anchors, and these two losses are directly combined by average. Figure 3 displays some examples on PASCAL VOC 2007 detected by model baseline+FL, where we get the satisfactory results with a wide range of scales and aspect ratios.

4.3 Training Process in RPN

To further analyze the influence of focal loss on RPN, we plot the cls and reg losses during training in Fig. 2. In the cls loss curve, the two focal loss equipped RPNs converge much faster and more stable than baseline. This effect is benefited from the intrinsic characteristic of focal loss that it is capable of training with much more anchors. For the reg loss curve, however, these two models perform worse than baseline; they are much unstable and have large scale. This may be the reason why focal loss can not boost RPN (and Faster R-CNN) greatly like one-stage detection model [3], e.g., after we get the satisfied scores for all the anchors, these anchors can not be refined well to produce satisfied proposals for R-CNN, which may affect the performance of the whole detection system. This may implies that the training signals produced by focal loss is conflict to those from bounding box regression in some terms.

Besides, it is worth to note that, RetinaNet [3], the network first applied focal loss to detection, decouples the cls and reg tasks into two sub-networks, and thus avoids this conflict signals problem. In this work, we only follow the original design of RPN where these two tasks share the same networks except the task specific layers. Thus, decoupling cls and reg tasks like RetinaNet in our focal loss equipped RPN may further improves the model performance. Other ways to make focal loss more compatible with bounding box regression can also be taken into consideration, and this will be our future work.

Table 3. COCO 2014 minival object detection average precision (%). Legend same as Table 2. *Baseline we trained using the public implementation.

Full size table

4.4 More Results on COCO

We also conduct experiments of the focal loss equipped RPN in COCO 2014 dataset [10], where we use train+valminumsminival and minival split following [2, 10]. As Table 3 shown, FL-softmax performs comparable to baseline, while baseline+FL is superior in all the metrics. In terms of the performance difference of baseline+FL in these two datasets, we assume that it is the difference between the statistics of each dataset that counts; COCO contains much more objects in each image than PASCAL VOC 2007 (7.58 vs 2.47 in average), which may results in differences in the training process, i.e., the anchors for computing focal loss in COCO is not such imbalanced like PASCAL VOC 2007. That is, in dataset with dense objects, such as COCO, focal loss combined with standard cross entropy loss may work better than either of them alone.

In other aspects, the original implementation [2] claims that the performance on COCO could continue to improve if we train with more iterations, e.g., 900k/1190k; thus the reason why baseline+FL performs better than baseline and FL-softmax may be its fast convergence characteristic. So, whether the statistic difference or convergence characteristic contribute to performance difference is further to be explored.

We note that the training processes also display the same trends as in Sect. 4.3. And these experimental results show that focal loss is also adaptable to more complicate datasets.

5 Conclusion

In this work, we investigate how to adapt focal loss to train RPN without applying the sampling strategy. By down-weighting the losses of those vast numbers of easy samples, focal loss can intrinsically handle the class imbalance problem and prevent their losses from overwhelming the detector. Using focal loss is capable of including much more samples for training. Thus, RPN can also take all anchors into account for training via replacing standard cross entropy loss with focal loss or simply combining them. As the experiments conducted on PASCAL VOC 2007 and COCO shown, it is feasible to train RPN without particularly designed sampling. We also discuss the compatibility between focal loss and bounding box regression in RPN, and this is left as future work.

Notes

1.
https://github.com/endernewton/tf-faster-rcnn.

References

Ren, S., He, K., Girshick, R., Sun, J.: FasterR-CNN: towards real-time object detection with region proposal networks. In: NIPS (2016)
Google Scholar
Chen, X., Gupta, A.: An implementation of faster R-CNN with study for region sampling. arXiv:1702.02138 (2017)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollàr, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Chapter Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS (2016)
Google Scholar
Girshick, R., Donahue, J., Darrell, T. Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
He, K., Gkioxari, G., Dollàr, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., et al.: SSD: Single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR (2017)
Google Scholar
Shrivastava, A., Gupta, A., Girshick, R.: Training region based object detectors with online hard example mining. In: CVPR (2016)
Google Scholar
Uijlings, J.R., Van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV 104, 154–171 (2013)
Article Google Scholar
Alexe, B., Deselaers, T., Ferrari, V.: Measuring the objectness of image windows. IEEE TPAMI 34(11), 2189–2202 (2012)
Article Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Chapter Google Scholar
Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-quality object detection. arXiv preprint arXiv:1412.1441v2 (2014)
Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: CVPR (2014)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Sung, K.-K., Poggio, T.: Learning and example selection for object and pattern detection. In: MIT A.I. Memo No. 1521 (1994)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Article Google Scholar
Singh, B., Davis, L. S.: An analysis of scale invariance in object detection - SNIP. In: CVPR (2018)
Google Scholar
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR (2018)
Google Scholar

Download references

Acknowledge

This work was supported in part by the National Natural Science Foundation of China under Grant 61532018, in part by the Lenovo Outstanding Young Scientists Program, in part by National Program for Special Support of Eminent Professionals and National Program for Support of Top-notch Young Professionals, in part by the National Postdoctoral Program for Innovative Talents under Grant BX201700255.

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Chengpeng Chen, Xinhang Song & Shuqiang Jiang
University of Chinese Academy of Scienses, Beijing, China
Chengpeng Chen, Xinhang Song & Shuqiang Jiang

Authors

Chengpeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xinhang Song
View author publications
You can also search for this author in PubMed Google Scholar
Shuqiang Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuqiang Jiang .

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, China
Jian-Huang Lai
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Tsinghua University, Beijing, China
Jie Zhou
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Xi’an Jiaotong University, Xi’an, China
Nanning Zheng
Peking University, Beijing, China
Hongbin Zha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, C., Song, X., Jiang, S. (2018). Focal Loss for Region Proposal Network. In: Lai, JH., et al. Pattern Recognition and Computer Vision. PRCV 2018. Lecture Notes in Computer Science(), vol 11257. Springer, Cham. https://doi.org/10.1007/978-3-030-03335-4_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-03335-4_32
Published: 02 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03334-7
Online ISBN: 978-3-030-03335-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

3 Focal Loss for RPN

3.1 RPN in Faster R-CNN

3.2 Focal Loss for Detection

3.3 Focal Loss for RPN

4 Experiments

4.1 Parameters Evaluation

4.2 Performance Comparison

4.3 Training Process in RPN

4.4 More Results on COCO

5 Conclusion

Notes

References

Acknowledge

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation