Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The field of object detection has changed drastically over the past few years. We have moved from manually designed features [13, 22] to learned ConvNet features [29, 37, 44, 68]; from the original sliding window approaches [22, 77] to region proposals [28, 29, 32, 63, 78]; and from pipeline based frameworks such as Region-based CNN (R-CNN) [29] to more end-to-end learning frameworks such as Fast [28] and Faster R-CNN [63]. The performance has continued to soar higher, and things have never looked better. There seems to be a growing consensus – powerful representations learned by ConvNets are well suited for this task, and designing and learning deeper networks lead to better performance.

Most recent gains in the field have come from bottom-up, feedforward framework of ConvNets. On the other hand, in the case of human visual system, the number of feedback connections significantly outnumber the feedforward connections. In fact, many behavioral studies have shown the importance of context and top-down information for the task of object detection. This raises a few important questions – Are we on the right path as we try to develop deeper and deeper, but only feedforward networks? Is there a way we can bridge the gap between empirical results and theory, when it comes to incorporating top-down information, feedback and/or contextual reasoning in object detection?

This paper investigates how we can break the feedforward mold in current detection pipelines and incorporate context, feedback and top-down information. Current detection frameworks have two components: the first component generates region proposals and the second classifies them as an object category or background. These region proposals seem to be beneficial because (a) they reduce the search space; and (b) they reduce false positives by focusing the ‘attention’ in right areas. In fact, this is in line with the psychological experiments that support the idea of priming (although note that while region proposals mostly use bottom-up segmentation [3, 32], top-down context provides the priming in humans [53, 75, 79]). So, as a first attempt, we propose to use top-down information in generating region proposals. Specifically, we add segmentation as a complementary task and use it to provide top-down information to guide region proposal generation and object detection. The intuition is that semantic segmentation captures contextual relationships between objects (e.g., support, likelihood, size etc. [6]), and will essentially guide the region proposal module to focus attention in the right areas and learn detectors from them.

But contextual priming using top-down attention mechanism is only part of the story. In case of humans, the top-down information provides feedback to the whole visual pathway (as early as V1 [41, 43]). Therefore, we further explore providing top-down feedback to the entire network in order to modulate feature extraction in all layers. This is accomplished by providing the semantic segmentation output as input to different parts of the network and training another stage of our model. The hypothesis is that equipping the network with this top-down semantic feedback would guide the visual attention of feature extractors to the regions relevant for the task at hand.

To summarize, we propose to revisit the architecture of a current state-of-the-art detector (Faster R-CNN [63]) to incorporate top-down information, feedback and contextual information. Our new architecture includes:

  • Semantic Segmentation Network: We augment Faster R-CNN with a semantic segmentation network. We believe this segmentation can be used to provide top-down feedback to Faster R-CNN (as discussed below).

  • Contextual Priming via Semantic Segmentation: In Faster R-CNN, both region proposal and object detection modules are feedforward. We propose to use semantic segmentation to provide top-down feedback to these modules. This is analogous to contextual priming; in this case top-down semantic feedback helps propose better regions and learn better detectors.

  • Iterative Top-Down Feedback: We also propose to use semantic segmentation to provide top-down feedback to low-level filters, so that they become better suited for the detection problem. In particular, we use segmentation as an additional input to lower layers of a second round of Faster R-CNN.

2 Related Work

Object detection was once dominated by the sliding window search paradigm [22, 77]. Soon after the resurgence of ConvNets for image classification [15, 44, 47], there were attempts at using this sliding window machinery with ConvNets [19, 66, 70]; but a key limitation was the computational complexity of brute-force search.

As a consequence, there was major paradigm shift in detection which completely bypassed the exhaustive search in favor of region-based methods and object proposals  [13, 8, 18, 32, 76, 84]. By reducing the search space, it allowed us to use sophisticated (both manually designed [12, 23, 78] and learned ConvNet [5, 29, 35, 36, 38, 52, 63]) features. Moreover, this also helped focus the attention of detectors to regions well supported by perceptual structures in the image. However, recently, Faster R-CNN [63] showed that even these region proposals can be generated by using ConvNet features. It removed segmentation from proposal pipeline by training a small network on top of ConvNet features that proposes a few object candidates. This raises an important question: Do ConvNet features already capture the structure that was earlier given by segmentation or does segmentation provide complementary information?

To answer this, we study the impact of using semantic segmentation in the region proposal and object detection modules of Faster R-CNN [63]. In fact, there has been a lot of interest in using segmentation in tandem with detection [10, 12, 17, 23]; e.g., Fidler et al. [23] proposed to use segmentation proposals as additional features for DPM detection hypothesis. In contrast, we propose to use semantic segmentation to guide/prime the region proposal generation itself. There is ample evidence of the importance of similar top-down contextual priming in the human visual system [14, 53], and its utility in reducing areas to focus our attention on for recognizing objects [75, 79].

This prevalence and success of region proposals is only part of the story. Another key ingredient is the powerful ConvNet features [37, 44, 68]. ConvNets are multi-layered hierarchical feature extractors, inspired by visual pathways in humans [21, 43]. But so far, our focus has been on designing deeper [37, 68] feedforward architectures, even when there is a broad agreement on the importance of feedback connections [11, 27, 41] and limitations of purely feedforward recognition [46, 80] in human visual systems. Inspired by this, we investigate how can we start incorporating top-down feedback in our current object detection architectures. There have been attempts earlier at exploiting feedback mechanisms; some well known examples are auto-context [74] and inference machines [64]. These iteratively use predictions from a previous iteration to provide contextual features to the next round of processing; however they do not trivially extend to ConvNet architectures. Closest to our goal are the contemporary works on using feedback to learn selective attention [55, 69] and using top-down iterative feedback to improve at a task at hand [7, 25, 48]. In this work, we additionally explore using top-down feedback from one task to another.

The discussion on using global top-down feedback to contextually prime object recognition is incomplete without relating it to ‘context’ in general, which has a long history in cognitive neuroscience [6, 39, 40, 53, 59, 60, 75, 79] and computer vision [16, 24, 58, 62, 7173, 81]. It is widely accepted that human visual inference of objects is heavily influenced by ‘context’, be it contextual relationships [6, 39], priming for focusing attention [53, 75, 79] or importance of scene context [14, 40, 59, 60]. These ideas have inspired lot of computer vision research (see [16, 24] for survey). However, these approaches seldom lead to strong empirical gains. Moreover, they are mostly confined to weaker visual features (e.g., [13]) and have not been explored much in ConvNet-based object detectors.

For region-based ConvNet object detectors, simple contextual features are slowly becoming popular; e.g., computing local context features by expanding the region [26, 56, 57, 83], using other objects (e.g., people) as context [33] and using other regions [30]. In comparison, the use of context has been much more popular for semantic segmentation. E.g., CRFs are commonly used to incorporate context and post-process segmentation outputs [9, 65, 82] or to jointly reason about regions, segmentation and detection [45, 83]. More recently, RNNs have also been employed to either integrate intuitions from CRFs [49, 61, 82] in end-to-end learning systems or to capture context outside the region [5]. But empirically, at least for detection, such uses of context have mostly given feeble gains.

3 Preliminaries: Faster R-CNN

We first describe the two core modules of the Faster R-CNN [63] framework (Fig. 1). The first module takes an image as input and proposes rectangular regions of interest (RoIs). The second module is the Fast R-CNN [28] (FRCN) detector that classifies these proposed regions. In this paper, both modules use the VGG16 [68] network, which has 13 convolutional (conv) and 2 fully connected (fc) layers. Both modules share all conv layers and branch out at conv5_3. Given an arbitrary sized image, the last conv feature map (conv5_3) is used as input to both the modules as described below.

Region Proposal Network (RPN). The region proposal module (Fig. 1(left) in ) is a small fully convolutional network that operates on the last feature map and outputs a set of rectangular object proposals, each with a score. RPN is composed of a conv layer and 2 sibling fc layers. The conv layer operates on the input feature map to produce a D-dim. output at every spatial location; which is then fed to two fc layers – classification (cls) and box-regression (breg). At each spatial location, RPN considers k candidate boxes (anchors) and learns to classify them as either foreground or background based on their IOU overlap with the ground-truth boxes. For foreground boxes, breg layer learns to regress to the closest ground-truth box. A typical setting is \({\texttt {D}}=512\) and \({k }=9\) (3 scales, 3 aspect-ratios) (see [63] for details).

Fig. 1.
figure 1

Faster R-CNN. (left) Overview of Region Proposal Network (RPN) and RoI classification and box regression. (right) Shorthand diagram of Faster R-CNN. (Color figure online)

Using RPN regions in FRCN. For training the Fast R-CNN (FRCN) module, a mini-batch is constructed using the regions from RPN. Each region in the mini-batch is projected onto the last conv feature map and a fixed-length feature vector is extracted using RoI-pooling [28, 38]. Each feature is then fed to two fc layers, which finally give two outputs: (1) a probability distribution over object classes and background; and (2) regressed coordinates for box re-localization. An illustration is shown in Fig. 1(left) in .

Training Faster R-CNN. Both RPN and FRCN modules of Faster R-CNN are trained by minimizing the multi-task loss (for classification and box-regression) from [28, 63] using mini-batch SGD. To construct a mini-batch for RPN, 256 anchors are randomly sampled with 1 : 1 foreground to background ratio; and for FRCN, 128 proposals are sampled with 1 : 3 ratio. We train both modules jointly using the ‘approximate joint training’. For more details, refer to [28, 29, 63, 67].

Given an image during training, a forward pass through all the conv layers produces conv5_3 feature map. RPN operates on this feature to propose two sets of regions, one each for training RPN and FRCN. Independent forward-backward passes are computed for RPN and FRCN using their region sets, gradients are accumulated at conv5_3 and back-propagated through the conv layers.

Why Faster R-CNN? Apart from being the current state-of-the-art object detector, Faster R-CNN is also the first framework that learns where to guide the ‘attention’ of an object detector along with the detector itself. This end-to-end learning of proposal generation and object detection provides a principled testbed for studying the proposed top-down contextual feedback mechanisms.

In the following sections, we first describe how we add a segmentation module to Faster R-CNN (Sect. 4.1) and then present how we use segmentation for top-down contextual priming (Sect. 4.2) and iterative feedback (Sect. 4.3).

4 Our Approach

We propose to use semantic segmentation as a top-down feedback signal to the RPN and FRCN modules in Faster R-CNN, and iteratively to the entire network. We argue that a raw semantic segmentation output is a compact signal that captures the desired contextual information such as relationships between objects (Sect. 2) along with global structures in the image, and hence is a good representation for top-down feedback.

4.1 Augmenting Faster R-CNN with Segmentation

The first step is to augment Faster R-CNN framework with an additional segmentation module. This module should ideally: (1) be fast, so that we do not give up the speed advantages of [28, 63]; (2) closely follow the network used by Faster R-CNN (VGG16 in this paper), for easy integration; and (3) use minimal (preferably no) post-processing, so that we can train it jointly with Faster R-CNN. Out of several possible architectures [4, 9, 51, 52, 82], we choose the ParseNet architecture [51] because of the simplicity.

Fig. 2.
figure 2

(a) Overview of ParseNet. (b) Shorthand diagram of our multi-task setup (Faster R-CNN + Segmentation). Refer to Sects. 4.1 and 5.2 for details.

ParseNet [51] is a fully convolutional network [52] for segmentation. It is fast because it uses filter rarefication technique (a-trous algorithm) from [9]. Its architecture is similar to VGG16. Moreover, it uses no post-processing; and instead adds an average pooling layer to incorporate global context; which is shown to have similar benefits to using CRFs [9, 49].

Architecture Details. An overview is shown in Fig. 2(a). The key difference from standard VGG16 is that the pooling after conv4_3 (pool4 \(_\text {seg}\)) does no down-sampling, as opposed to the standard pool4 which down-samples by a factor of 2. After the conv5 block, it has two 1 \(\times \) 1 conv layers with 1024 channels applied with a filter stride [9, 51]. Finally, it has a global average pooling step which given the feature map of after any layer (H \(\times \) W \(\times \) D) computes its spatial average (1 \(\times \) 1 \(\times \) D) and ‘unpools’ the features. Both source and its average feature maps are normalized and used to predict per-pixel labels. These outputs are then fused and a 8 \(\times \)  deconv layer is used to produce the final output.

Faster R-CNN with Segmentation – A Multi-task Setup. In the joint network (Fig. 2(b)), both the Faster R-CNN modules and the segmentation module share the first 10 conv layers (conv1_1 - conv4_3) and differ pool4 onwards. For the segmentation module, we branch out pool4 \(_\text {seg}\) layer with stride of 1 and add the remaining ParseNet layers (conv5_1 to deconv) (Fig. 2). The final architecture is a multi-task setup [54], which produces both semantic segmentation and object detection outputs simultaneously.

Training Details. Now that we have a joint architecture, we can train segmentation, RPN and detection modules by minimizing a multi-task loss. However, there are some key issues: (1) Faster R-CNN can operate on an arbitrary sized input image, whereas ParseNet requires a fixed 500 \(\times \) 500 image. In this joint framework, our segmentation module is adapted to handle arbitrary sized images; (2) Faster R-CNN and ParseNet are trained using very different set of hyperparameters (e.g., learning rate schedule, batch-size etc.); and neither set of parameters is optimal for the other. So for joint training, we modify the hyperparameters of segmentation module and shared layers. Details on these design decisions and analysis of their impact will be presented in Sect. 5.2.

This Faster R-CNN + Segmentation framework serves as the base model on top of which we add top-down contextual feedback. We will also use this multi-task model as our primary baseline (Base-MT) as it is trained using both segmentation and detection labels but does not have contextual feedback.

4.2 Contextual Priming via Segmentation

We propose to use semantic segmentation as top-down feedback to the region proposal and object detection modules of our base model. We argue that segmentation captures contextual information which will ‘prime’ the region proposal and object detection modules to propose better regions and learn better detectors.

In our base multi-task model, the Faster R-CNN modules operate on the conv feature map from the shared network. To contextually prime these modules, their input is modified to be a combination of aforementioned conv features and the segmentation output. Both modules can now learn to guide their operations based on the semantic segmentation of an image – it can learn to ignore background regions, find smaller objects or find large occluded objects (e.g., tables) etc.. Specifically, we take the raw segmentation output and append it to the conv4_3 feature. The conv5 block of filters operate on this new input (‘seg \(+\) conv4_3’) and their output is input to the individual Faster R-CNN modules. Hence, a top-down feedback signal from segmentation ‘primes’ both Faster R-CNN modules. However, because of the RoI-pooling operation, the detection module only sees the segmentation signal local to a particular region. To provide a global context to each region, we also append segmentation to the fixed-length feature vector (‘seg \(+\) pool5’) before feeding it to fc6. Overview in Fig. 3(a).

This entire system (three modules with connections between them) is trained jointly. After a forward pass through the shared conv layers and the segmentation module, their outputs are used as input to both Faster R-CNN modules. A forward-backward pass is performed for both RPN and FRCN. Next, the segmentation module does a backward pass using the gradients from its loss and from the other modules. Finally, gradients are accumulated at conv4_3 from all three modules and backward pass is performed for the shared conv layers.

Fig. 3.
figure 3

Overview of the proposed models for top-down feedback. (a) Contextual Priming via Segmentation (Sect. 4.2) uses segmentation as top-down feedback signal to guide the RPN and FRCN modules of Faster R-CNN. (b) Iterative Feedback (Sect. 4.3) is a 2-unit model, where the Stage-1 provides top-down feedback for Stage-2 filters. (c) Joint Model (Sect. 4.4) uses (a) as the base unit in (b).

Architecture Details. Given an \(\left( {\texttt {H}}_\texttt {I}\times {\texttt {W}}_\texttt {I}\times 3\right) \) input, the conv4_3 produces a \(\left( {\texttt {H}}_\texttt {c}\times {\texttt {W}}_\texttt {c}\times 512\right) \) feature map, where \(\left( {\texttt {H}}_\texttt {c},{\texttt {W}}_\texttt {c}\right) \approx \left( {\texttt {H}}_\texttt {I}/8,{\texttt {W}}_\texttt {I}/8\right) \). Using this feature map, the segmentation module produces a \(\left( {\texttt {H}}_\texttt {I} \times {\texttt {W}}_\texttt {I} \times \left( \texttt {K}+1\right) \right) \) output, which is a pixel-wise probability distribution over \(\texttt {K}+1\) classes. We ignore the background class and only use \(\left( {\texttt {H}}_\texttt {I} \times {\texttt {W}}_\texttt {I} \times \texttt {K}\right) \) output, which we refer to as S. Now, S needs to be combined with conv4_3 feature for the Faster R-CNN modules and each region’s \(\left( 7\times 7\times \texttt {K}\right) \)-dim. pool5 feature map for FRCN, but there are 2 issues: (1) spatial dimensions of S does not match either, and (2) feature values from different layers are at drastically different scales [51]. To deal with the spatial dimension mis-match, we utilize the RoI/spatial-pooling layer from [28, 38]: We maxpool S using an adaptive grid to produce two outputs \({\texttt {S}}_{\texttt {c}}\) and \({\texttt {S}}_{\texttt {p}}\), which have the same spatial dimensions as conv4_3 and pool5 respectively. Now, we normalize and scale \({\texttt {S}}_{\texttt {c}}\) to \({\texttt {S}}_{\texttt {cN}}\) and \({\texttt {S}}_{\texttt {p}}\) to \({\texttt {S}}_{\texttt {pN}}\), such that their L2-norm [51] is of the same scale as the per-channel L2-norm of their corresponding features (conv4_3 and pool5 respectively). Now, we append \({\texttt {S}}_{\texttt {cN}}\) to conv4_3 and the resulting \(\left( {\texttt {H}}_\texttt {c}\times {\texttt {W}}_\texttt {c}\times \left( 512+\texttt {K}\right) \right) \) feature is the input for Faster R-CNN. Finally, we append \({\texttt {S}}_{\texttt {pN}}\) with each region’s pool5 and the resulting \(\left( 7\times 7\times \left( 512+\texttt {K}\right) \right) \) feature is the input for fc6 of FRCN. This network architecture is trained from a VGG16 initialized base model; and the additional K channels in conv5_3 and fc6 are initialized randomly using [31, 68]. Refer to Fig. 3(a) for an overview.

4.3 Iterative Feedback via Segmentation

The architecture proposed in the previous section provides top-down semantic feedback and modulates only the Faster R-CNN module. We also propose to provide top-down information to the whole network, especially the shared conv layers, to modulate low-level filters. The hypothesis is that this feedback will help the earlier conv layers to focus on areas likely to have objects. We again build from the Base-MT model (Sect. 4.1).

This top-down feedback is iterative in nature and will pass from one instantiation of our base model to another. To provide this top-down feedback, we take the raw segmentation output of our base model (Stage-1) and append it to the input of the conv layer to be modulated in the second model instance (Stage-2) (see Fig. 3(b)). E.g., to modulate the first conv layer of Stage-2, we append the Stage-1 segmentation signal to the input image, and use this combination as the new input to conv1_1. This feedback mechanism is trained stage-wise: the Stage-1 model (Base-MT) is trained first; and then it is frozen and only the Stage-2 model is trained. This iterative feedback is similar to [7, 48]; the key difference being that they only focus on iteratively improving the same task, whereas in this work, we also use feedback from one task to improve another.

Architecture Details. Given the pixel-wise probability output of the Stage-1 segmentation module, the background class is ignored and the remaining output (S) is used as the semantic feedback signal. Again, S needs to be resized, rescaled and/or normalized to match the spatial dimensions and the feature values scale of the input to various conv layers. To append with the input image, S is re-scaled and centered element-wise to lie in \(\left[ -127,128\right] \). This results in a new \(\left( {\texttt {H}}_{\texttt {I}} \times {\texttt {W}}_{\texttt {I}} \times (3+{\texttt {K}})\right) \) input for conv1_1. To modulate conv2_1, conv3_1 and conv4_1, we maxpool and L2-normalize S to match the spatial dimensions and the feature value scales of pool1, pool2 and pool3 features respectively (similar to Sect. 4.2). The filters corresponding to additional K channels in conv1_1, conv2_1, conv3_1 and conv4_1 are initialized using [31].

4.4 Joint Model

So far, given our multi-task base model, we have proposed a top-down feedback for contextual priming of region proposal and object detection modules and an iterative top-down feedback mechanism to the entire architecture. Next, we put these two pieces together in a single joint framework. Our final model is a 2-unit model: each individual unit being the contextual priming model (from Sect. 4.2), and both units being connected for iterative top-down feedback (Sect. 4.3). We train this 2-unit model stage-wise (Sect. 4.3). Architecture details of the joint model follow from Sects. 4.2 and 4.3 (see Fig. 3(c)).

Through extensive evaluation, presented in the following sections, we show that: (1) individually, both contextual priming and iterative feedback models are effective and improve performance; and (2) the joint model is better than both individual models, indicating their complementary nature. We would like to highlight that our method is fairly general – both segmentation and detection modules can easily utilize newer network architectures (e.g., [4, 37]).

5 Experiments

We conduct experiments to better understand the impact of contextual priming and iterative feedback; and provide ablation analysis of various design decisions. Our implementation uses the Caffe [42] library.

5.1 Experimental Setup

For ablation studies, we use the multi-task setup from Sect. 4.1 as our baseline (Base-MT). We also compare our method to Faster R-CNN [63] and ParseNet [51] frameworks. For quantitative evaluation, we use the standard mean average precision (mAP) [20] metric for object detection and mean intersection-over-union metric (mIOU) [20, 28] for segmentation.

Datasets. All models in this section are trained on the PASCAL VOC12 [20] segmentation set (12S), augmented with the extra annotations (A) from [34] as is standard practice. Results are analyzed on VOC12 segmentation val set. For analysis, we chose the segmentation set, and not detection, because all images have both segmentation and bounding-box annotations; this helps us isolate the effects of using segmentation as top-down semantic feedback without worrying about missing segmentation labels in the standard detection split. Results on the standard splits will be presented in Sect. 6.

Table 1. Ablation analysis of modifying ParseNet training methodology (Sect. 5.2).

5.2 Base Model – Augmenting Faster R-CNN with Segmentation

Faster R-CNN and ParseNet both use mini-batch SGD for training, however, they follow different training methodologies. We first describe the implementation details and design decisions adopted to augment the segmentation module to Faster R-CNN and report baseline performances.

ParseNet Optimization. ParseNet is trained for 20 k SGD iterations using an effective mini-batch of 8 images, an initial learning rate (LR) of \(10^{-8}\) and polynomial LR decay policy. Compare this to Faster R-CNN, which is trained for 70 k SGD iterations with a mini-batch size of 2, \(10^{-3}\) initial LR and step LR decay policy (step at 50 k). Since we are augmenting Faster R-CNN, we try to adapt ParseNet’s optimization. On the 12 S val set, [51] reports \(69.6\,\%\) (we achieved \(68.2\,\%\) using the released code, Table 1(1–2)). We will refer to the latter as ParseNet throughout. Similar to [52], ParseNet does not normalize the Softmax loss by number of valid pixels. But to train with Faster R-CNN in a multi-task setup, all losses need to have similar magnitude; so, we normalize the loss of ParseNet and modify the LR accordingly. Next, we change the LR decay policy from polynomial to step (step at 12.5 k) to match that of Faster R-CNN. These changes result in similar performance (\(+0.3\) points, Table 1(2–3)). We now reduce the batch size to 2 and adjust the LR appropriately (Table 1(4)). To keep the base LR of Faster R-CNN and ParseNet same, we change it to \(10^{-3}\) and modify the LR associated with each ParseNet layer to 0.25, thus keeping the same effective LR for ParseNet (Table 1(4–5)).

Training Data. ParseNet re-scales the input images and their segmentation labels to a fixed size (500 \(\times \) 500), thus ignoring the aspect-ratio. On the other hand, Faster R-CNN maintains the aspect-ratio and re-scales the input images such that their shorter side is 600 pixels (and the max dim. is capped at 1000). We found that ignoring the aspect-ratio drops Faster R-CNN performance and maintaining it drops the performance of ParseNet (\(-1.8\) points, Table 1(5–6)). Because our main task is detection, we opted to use Faster R-CNN strategy, and treat the new ParseNet (ParseNet\(^*\)) as the baseline for our base model.

Base Model Optimization. Following the changes mentioned above, our base model uses these standardized parameters: batch size of 2, \(10^{-3}\) base LR, step decay policy (step at 50 k), LR of 0.25 for segmentation and shared conv layers, and 80k SGD iterations. This model serves as our multi-task baseline (Base-MT).

Baselines. For comparison, re-train Fast [28] and Faster R-CNN [63] on VOC 12S+A training set. Results of the Base-MT model for detection and segmentation are reported in Tables 2 and 3 respectively. Performance increases by 0.3 mAP on detection and drops by 0.1 mIOU on segmentation. This will serve as our primary baseline.

Table 2. Detection results on VOC 2012 segmentation val set. All methods use VOC12S+A training set (Sect. 5.1). Legend: S: uses segmentation labels (Sect. 4.1), P: contextual priming (Sect. 4.2), F: iterative feedback (Sect. 4.3)

5.3 Contextual Priming

We evaluate the effects of using segmentation as top-down semantic feedback to the region proposal generation and object detection modules. We follow the same optimization hyperparameters as the Base-MT model, and report the results in Tables 2 and 3. Table 2 shows that providing top-down feedback via priming to the Faster R-CNN modules improves its detection performance by \(\mathbf{1.4 }\) points over the Base-MT model and \(\mathbf{1.7 }\) points over Faster R-CNN. Results in Table 3 show that performance of segmentation drops slightly when it is used for priming.

Table 3. Segmentation results on VOC 2012 segmentation val set. All methods use VOC12S+A training set (Sect. 5.1). Legend: S: uses segmentation labels, P: contextual priming, F: iterative feedback
Table 4. Analation analysis of Contextual Priming and Iterative Feedback on VOC 12S val set. All methods use VOC 12S+A train set for training

Design Evaluation. In Table 4(a), we report the impact of providing segmentation signal to different modules. We see that just priming conv5_1 gives a 1 point boost over Bast-MT and adding the segmentation signal to each individual region (‘seg \(+\) pool5’ to fc6) gives another 0.4 points boost. It is interesting that the segmentation performance is not affected when priming conv5_1, but it drops by 0.5 mIOU when we prime each region. Our hypothesis is that gradients accumulated from all regions in the mini-batch start overpowering the gradients from segmentation. To deal with this, methods like [54] can be used in the future.

5.4 Iterative Feedback

Next we study the impact of giving iterative top-down semantic feedback to the entire network. In this 2-unit setup, the first unit (Stage-1) is a trained Base-MT model and the second unit (Stage-2) is a Stage-1 initialized Base-MT model. During inference, we have the option of using the outputs from both units or just the Stage-2 unit. Given that segmentation is used as feedback, it is supposed to self-improve across units, therefore we use the Stage-2 output as our final output (similar to [7, 48]). For detection, we combine the outputs from both units; because the Stage-2 unit is modulated by segmentation, and the first unit is not, hence both might focus on different regions.

This iterative feedback improves the segmentation performance (Table 3) by \(\mathbf{3.7 }\) points over Base-MT (\(\mathbf{3.5 }\) points over ParseNet\(^*\)). For detection, it improves over the Base-MT model by \(\mathbf{1.7 }\) points (\(\mathbf 2 \) points over Faster R-CNN) (Table 2).

Design Evaluation. We study the impact of: (1) varying the degree of feedback to the Stage-2 unit, and (2) different Stage-2 initializations. In Table 4(b), we see that when initializing the Stage-2 unit with an ImageNet trained network, varying iterative feedback does not have much impact; however, when initializing with a Stage-1 model, providing more feedback leads to better performance. Specifically, iterative feedback to all shared conv layers improves both detection and segmentation by 1.7 mAP and 3.7 mIOU respectively, as opposed to feedback to just conv1_1 (as in [7, 48]) which results in lower gains (Table 4(b)). Our hypothesis is that iterative feedback to a Stage-1 initialized unit allows the network to correct its mistakes and/or refine its predictions; therefore, providing more feedback leads to better performance.

5.5 Joint Model

Finally, we evaluate our joint 2-unit model, where each unit is a model with contextual priming, and both units are connected via segmentation feedback. In this setup, a trained contextual priming model is used as the Stage-1 unit as well as the initialization for the Stage-2 unit. We remove the dropout layers from Stage-2 unit. Inference follows the procedure described in Sect. 5.4.

As shown in Table 2, for detection, the joint model achieves \(\mathbf{77.8 }\,\%\) mAP (\(\mathbf{+2.2 }\) points over Base-MT and \(\mathbf{+2.5 }\) points over Faster R-CNN), which is better than both priming only and feedback only models. This suggests that both forms of top-down feedback are complementary for object detection. The segmentation performance (Table 3) is similar to the feedback only model, which is expected since in both cases, the segmentation module receives similar feedback.

6 Results

We now report results on the PASCAL VOC and MS COCO [50] datasets. We also evaluate the region proposal generation on the proxy metric of average recall.

Experimental Setup. When training on the VOC datasets with extra data (Tables 56 and 7), we use 100 k SGD iterations (other hyperparameters follow Sect. 5); and for MS COCO, we use 490 k SGD iterations with an initial LR of \(10^{-3}\) and decay step size of 200 k, owing to a larger epoch size.

VOC07 and VOC12 Results. Table 5 shows that on VOC07, our joint priming and feedback model improves the detection mAP by \(\mathbf {1.7}\) points over Base-MT and \(\mathbf {3.2}\) points over Faster R-CNN. Similarly, on VOC12 (Table 6), priming and feedback lead to \(\mathbf {1.5}\) points boost over Bast-MT (\(\mathbf {2.2}\) over Faster R-CNN). For segmentation on VOC12 (Table 7), we see a huge 5 point boost in mIOU over Base-MT. We would like highlight that both Base-MT and our joint model use exactly the same annotations and hyperparameters; therefore the performance boosts are because of contextual priming and iterative feedback in our model.

Table 5. Detection results on VOC 2007 detection test set. All methods are trained on union of VOC07 trainval and VOC12 trainval
Table 6. Detection results on VOC 2012 detection test set. All methods are trained on union of VOC07 trainval, VOC07 test and VOC12 trainval
Table 7. Segmentation results on VOC 2012 segmentation test set. All methods are trained on union of VOC07 trainval, VOC07 test and VOC12 trainval

Recall-to-IOU. Since our hypothesis is that priming and feedback lead to better proposal generation, we also evaluate the recall of region proposals by the RPN module from various models, at different IOU thresholds. In Fig. 4, we show the results of using 2000 proposal per RPN module. Since feedback models have 2 units, we report their number with both 4000 and top 2000 proposals (sorted by cls score). As can be seen priming, feedback and joint models all lead to higher average recall (shown in legend) over the baseline RPN module.

Fig. 4.
figure 4

Recall-to-IoU on VOC12 Segmentation val set (left) and VOC07 test set (right) (best viewed digitally).

Table 8. Detection results on COCO minival5k set.

MS COCO Results. We also perform additional analysis of contextual priming on the COCO [50] dataset. Our priming model results in \(+\mathbf{1.2 }\) AP points (\(+2.1\) AP50) over Faster R-CNN and \(+\mathbf{0.8 }\) AP points (\(+1.1\) AP50) over Base-MT on the COCO minival5k set [5, 28]. On further analysis, we notice that the most performance gains are for objects where context should intuitively help; e.g., \(+12.4\) for ‘parking-meter’, \(+8.7\) for ‘suitcase’, \(+8.3\) for ‘umbrella’ etc. on AP50 wrt. to Faster R-CNN. In fact, we consistently see \({>}{} \mathbf 3 \) points improvement over Base-MT (\({>}{} \mathbf 5 \) points over Faster R-CNN) in AP50 for top-20 improved objects (Table 8).

7 Conclusion

We presented and investigated how we can incorporate top-down semantic feedback in the state-of-the-art Faster R-CNN framework. We proposed to augment a segmentation network to Faster R-CNN, which is then used to provide top-down contextual feedback to the region proposal generation and object detection modules. We also use this segmentation network to provide top-down feedback to the entire Faster R-CNN network iteratively. Our results demonstrate the effectiveness of these top-down feedback mechanisms for the tasks of region proposal generation, object detection and semantic segmentation.