Keywords

1 Introduction

Semantic segmentation is a fundamental task in computer vision, whose goal is to associate a semantic object category with each pixel in an image [1,2,3,4]. Many real-world applications, e.g., autonomous driving [4], medical analysis [5], and computational photography [6], can benefit from accurate semantic segmentation that provides detailed information about the content of an image.

In recent years, we have witnessed a tremendous progress in semantic segmentation accuracy. These advances are largely driven by the power of fully convolutional networks (FCNs) [7] and their derivatives [8, 9], which are pre-trained on large-scale datasets [2, 10]. It has also become apparent that accounting for the semantic context leads to more accurate segmentation of individual objects [9, 11,12,13,14,15,16,17,18,19].

The feature maps extracted by the deeper layers of a convolutional network encode higher-level semantic information and context contained in the large receptive field of each neuron. In contrast, the shallower layers encode appearance and location. State-of-the-art semantic segmentation approaches propagate coarse semantic context information back to the shallow layers, yielding richer features, and more accurate segmentations [7, 9, 17,18,19,20,21]. However, in these methods, context is typically propagated along the feature hierarchy in a one-directional manner, as illustrated in Fig. 1(a) and (b).

Fig. 1.
figure 1

Alternative approaches for encoding multi-scale context information into segmentation features for per-pixel prediction. The spatial pyramid pooling (SPP) network (a) and the encoder-decoder (ED) network (b) propagate information across the hierarchy in a one-directional fashion. In contrast, our multi-scale context intertwining architecture (c) exchanges information between adjacent scales in a bidirectional fashion, and hierarchically combines the resulting feature maps. Figure 2 provides a more detailed illustration of the multi-stage recurrent context intertwining process.

In this paper, we advocate the idea that more powerful features can be learned by enabling context to be exchanged between scales in a bidirectional manner. We refer to such information exchange as context intertwining. The intuition here is that semantics and context of adjacent scales are strongly correlated, and hence the descriptive power of the features may be significantly enhanced by such intertwining, leading to more precise semantic labeling.

Our approach is illustrated by the diagram in Fig. 1(c). Starting from a collection of multi-scale convolutional feature maps, each pair of successive feature maps is intertwined together to yield a new enriched feature map. The intertwining is modeled using two chains of long short-term memory (LSTM) units [22], which repeatedly exchange information between them, in a bidirectional fashion, as shown in Fig. 2. Each intertwining phase reduces the number of feature maps by one, resulting in a hierarchical feature combination scheme (the horizontal hierarchy in Fig. 1(c)). Eventually, a single enriched high-resolution feature map remains, which is then used for per-pixel semantic label inference.

Furthermore, rather than using fixed information propagation routes for context aggregation, we subdivide images into super-pixels, and use the spatial relationship between the super-pixel in order to define image-adapted feature connections.

We demonstrate the effectiveness of our approach by evaluating it and comparing it to an array of state-of-the-art semantic segmentation methods on four public datasets (PASCAL VOC 2012 [1], PASCAL-Context [3], NYUDv2 [23] and SUN-RGBD [24] datasets). On the PASCAL VOC 2012 validation set, we outperform the state-of-the-art (with 85.1% mean IoU). On the PASCAL VOC 2012 test set, our performance (87.0% mean IoU) is second only to the recent result of Chen et al. [25], who uses a backbone network trained on an internal JFT dataset [26,27,28], while our backbone network is trained on the ImageNet dataset [10].

2 Related Work

Fully convolutional networks (FCNs) [7] have proved effective for semantic image segmentation by leveraging the powerful convolutional features of classification networks [27, 29, 30] pre-trained on large-scale data [10, 24]. The feature maps extracted by the different convolutional layers have progressively coarser spatial resolutions, and their neurons correspond to progressively larger receptive fields in the image space. Thus, the collection of feature maps of different resolutions encodes multi-scale context information. Semantic segmentation methods have been trying to exploit this multi-scale context information for accurate segmentation. In this paper, we focus on two aspects, i.e., Feature Combination and Feature Connection, which have also been explored by most of the recent works [7, 9, 18,19,20, 31,32,33] to make better use of the image context.

Feature Combination. To capture the multi-scale context information in the segmentation features, many works combine feature maps whose neurons have different receptive fields. Various schemes for the combination of feature maps have been proposed. Spatial pyramid pooling (SPP) [34] has been successfully applied for combining different convolutional feature maps [9, 18, 20]. Generally, the last convolutional feature map, which is fed to the pixel-wise classifier, is equipped with an SPP (see Fig. 1(a)). But the SPP-enriched feature maps have little detailed information that is missed by the down-sampling operations of an FCN. Though the atrous convolution can preserve the resolutions of feature maps for more details, it requires a large budget of GPU storage for computation [27, 29, 30]. To save the GPU memory and improve the segmentation performance, some networks [17, 19, 21, 35] utilize an Encoder-Decoder (ED) network to gradually combine adjacent feature maps along the top-down hierarchy of a common FCN architecture, propagating the semantic information from the low-resolution feature maps to the high-resolution feature maps and using the high-resolution feature maps to recover the details of objects (see Fig. 1(b)). The latest work [25] further uses the ED network along with an atrous spatial pyramid pooling (ASPP) [20], and combines multi-resolution feature maps for information enrichment. In the ED network, each feature map of the decoder part only directly receives the information from the feature map at the same level of the encoder part. But the strongly-correlated semantic information, which is provided by the adjacent lower-resolution feature map of the encoder part, has to pass through additional intermediate layers to reach the same decoder layer, which may result in information decay.

In contrast, our approach directly combines pairs of adjacent feature maps in the deep network hierarchy. It creates new feature maps that directly receive the semantic information and context from a lower-resolution feature map and the improved spatial detail from a higher-resolution feature map. In addition, in our architecture the information exchange between feature maps is recurrent and bidirectional, enabling better feature learning. The pairwise bidirectional connections produce a second, horizontal hierarchy of the resulting feature maps, leading up to a full resolution context-enriched feature map (rightmost feature map in Fig. 1(c)), which is used for pixel-wise label prediction.

Feature Connection. Connections between feature maps enable the communication between neurons with different receptive field sizes, yielding new feature maps that encode multi-scale context information. Basically, FCN-based models [7,8,9, 17,18,19,20, 31] use separate neurons to represent the regular regions in an image. Normally, they use convolutional/pooling kernels with predefined shapes to aggregate the information of adjacent neurons, and propagate this information to the neurons of other feature maps. But traditional convolutional/pooling kernels only capture the context information in a local scale. To leverage richer context information, graphical models are integrated with FCNs [12, 13, 16]. Graphical models build dense connections between feature maps, allowing neurons to be more sensitive to the global image content that is critical for learning good segmentation features. Note that previous works use one-way connections that extract context information from the feature maps separately, which is eventually combined. Thus, the learned features at a given scale are not given the opportunity to optimally account for the multi-scale context information from all of the other scales.

In contrast to previous methods, our bidirectional connections exchange multi-scale context information to improve the learning of all features. We employ super-pixels computed based on the image structure, and use the relationship between them to define the exchange routes between neurons in different feature maps. This enables more adaptive context information propagation. Several previous works [31,32,33, 36] also use super-pixels to define the feature connections. And information exchange has been studied in [37, 38] for object detection. But these works do not exchange the information between feature maps of different resolutions, which is critical for semantic segmentation.

3 Multi-scale Context Intertwining

To utilize multi-scale context information, the common networks use one-way connections to combine feature maps of different resolution, following the top-down order of the network hierarchy (see Fig. 1(a) and (b)). Here, we present a multi-scale context intertwining (MSCI) architecture, where the context information can be propagated along different dimensions. The first dimension is along the vertical deep hierarchy (see Fig. 1(c)): our context intertwining scheme has connections to exchange the multi-scale context information between the adjacent feature maps. The connection is bidirectional with two different long short-term memory (LSTM) chains [22] that intertwines feature maps of different resolution in a sequence of stages. By training the LSTM units, the bidirectional connections learn to produce more powerful feature maps. The second dimension is along the horizontal hierarchy: the feature maps produced by our bidirectional connections are fed to the next phase of context intertwining, which can encode the context information memorized by our bidirectional connections into the new feature maps.

Fig. 2.
figure 2

Multi-scale context intertwining between two successive feature maps in the deep hierarchy. The green arrows propagate the context information from the lower-resolution feature map to the higher-resolution one. Conversely, the blue arrows forward information from the higher-resolution feature map to augment the lower-resolution one. The orange circle in each stage indicates the hidden features output by LSTMs, including the cell states and gates. (Color figure online)

The overall MSCI architecture is illustrated in Fig. 1(c). Initially, we use the backbone FCN to compute a set \(\{F^l\}\) convolutional feature maps of different resolutions, where \(l=1,...,L\) and \(F^{1}\) has the highest resolution. Figure 2 provides a more detailed view of context intertwining between two successive feature maps \(F^l\) and \(F^{l+1}\). To exchange the context information between \(F^l\) and \(F^{l+1}\), we construct a bidirectional connection \(\mathcal {L}\):

$$\begin{aligned} \{Q^{l}, C^{l \rightarrow l+1}_T, C^{l+1 \rightarrow l}_T\} = \mathcal {L}(F^l, F^{l+1}, C^{l \rightarrow l+1}, C^{l+1 \rightarrow l}, P^{l \rightarrow l+1}, P^{l+1 \rightarrow l}, T). \end{aligned}$$
(1)

The bidirectional connection \(\mathcal {L}\) consists of two different LSTM chains. One chain has the parameter set \(P^{l \rightarrow l+1}\). It extracts the context information from \(F^l\) and passes it to \(F^{l+1}\). The other chain has the parameter set \(P^{l+1 \rightarrow l}\) and passes context information from \(F^{l+1}\) to \(F^l\). \(C^{l \rightarrow l+1}\) and \(C^{l+1 \rightarrow l}\) are the cell states of the two LSTMs, and they are initialized to zeros in the very beginning. As shown in Fig. 2, the information exchange takes place over T stages. At each stage t, information is exchanged between the feature maps \(F^l_t\) and \(F^{l+1}_t\), yielding the maps \(F^l_{t+1}\) and \(F^{l+1}_{t+1}\). Note that the resulting feature map \(F^{l}_T\) has higher resolution than \(F^{l+1}_T\). Thus, we deconvolve the feature map \(F^{l+1}_T\) with the kernel \(D^{l+1}_{f}\) and add it to \(F^{l}_T\) to obtain a combined high-resolution feature map \(Q^{l}\):

$$\begin{aligned} Q^{l} = F^{l}_T + D^{l+1}_{f} *F^{l+1}_T. \end{aligned}$$
(2)

Note that the feature map \(Q^{l}\) and the cell states \(C^{l \rightarrow l+1}_T\) and \(C^{l+1 \rightarrow l}_T\) can be further employed to drive the next phase of context intertwining (the next level of the horizontal hierarchy). Along the LSTM chains, the feature maps contain neurons with larger receptive fields, i.e., with richer global context. Besides, the cell states of LSTMs can memorize the context information exchanged at different stages. Due to the shortcut design of the cell states [22], the local context from the early stages can be easily propagated to the last stage, encoding the multi-scale context including the local and global information to the final feature map.

figure a

The entire MSCI process is summarized in Algorithm 1. We assume the MSCI process has K phases totally. Each phase of Algorithm 1 produces new feature maps. As each pair of feature maps is intertwined, the corresponding cell states \((C^{l \rightarrow l+1}, C^{l+1 \rightarrow l})\) are iteratively updated to provide the memorized context to assist the information exchange in the next phase. Finally, the output is the high-resolution feature map \(Q^{1}_K\) that is fed to the pixel-wise classifier for segmentation. Algorithm 1 describes the feed-forward pass through the LSTMs. We remark that the LSTM parameters are reusable, and the LSTMs are trained using the standard stochastic gradient descent (SGD) algorithm with back-propagation. Below, we focus on a single context intertwining phase, and thus omit the subscript k to simplify notation.

4 Bidirectional Connection

In this section, we describe in more detail the bidirectional connections that enable mutual exchange of context information between low- and high-resolution feature maps. Our bidirectional connections are guided by the super-pixel structure of the original image, as illustrated in Fig. 3. Given an input image I, we divide it into non-overlapping super-pixels, which correspond to a set of regions \(\{S_n\}\). Let \(F^{l}_t\) and \(F^{l+1}_t\) denote two adjacent resolution feature maps in our network, where l is the resolution level and t is the LSTM stage. The context information exchange between \(F^{l}_t\) and \(F^{l+1}_t\) is conducted using the regions defined by the super-pixels. Informally, at each of the two levels, for each region \(S_n\) we first aggregate the neurons whose receptive fields are centered inside \(S_n\). Next, we sum together the aggregated features of \(S_n\) and all of its neighboring regions at one level and pass the resulting context information to the neurons of the other level that reside in region \(S_n\). This is done in both directions, as shown in Fig. 3(a) and (b). Thus, we enrich the locally aggregated context information of each neuron with that of its counterpart in the other level, as well as with the more global context aggregated from the surrounding regions. Our results show that this significantly improves segmentation accuracy.

Fig. 3.
figure 3

Bidirectional context aggregation. The features are partitioned into different regions defined by super-pixels. We aggregate the neurons resided in the same region, and pass the information of the adjacent regions along the bidirectional connection (a) from a low-resolution feature to a high-resolution feature; and (b) from a high-resolution feature to a low-resolution feature.

Formally, given the feature map \(F^{l}_t\) and a region \(S_n\), we first aggregate the neurons in \(S_n\), yielding a regional context feature \(R^{l}_{n, t} \in \mathbb {R}^{C}\):

$$\begin{aligned} R^{l}_{n, t} = \sum _{(h, w) \in \varPhi (S_n)} F^{l}_{t}(h, w), \end{aligned}$$
(3)

where \(\varPhi (S_n)\) denotes the set of centers of the receptive fields inside the region \(S_n\). Next, we define a more global context feature \(M^{l}_{n, t}\), by aggregating the regional features of \(S_n\) and of its adjacent regions \(\mathcal {N}(S_n)\):

$$\begin{aligned} M^{l}_{n, t} = \sum _{S_m \in \mathcal {N}(S_n)} R^{l}_{m, t}. \end{aligned}$$
(4)

The above features are propagated bidirectionally between \(F^{l}_t\) and \(F^{l+1}_t\) using a pair of LSTM chains, as illustrated in Fig. 2. In the \(t^{th}\) stage, an LSTM unit generates a new feature \(F^{l+1}_{t+1}\) from \(F^{l+1}_{t}\), \(R^{l}_{n, t}\), and \(M^{l}_{n, t}\), as follows:

$$\begin{aligned}&G^{l \rightarrow l+1}_{i, t}(h, w) = \sigma (W^{l+1}_{i} *F^{l+1}_t(h, w) + W^{l}_{s, i} *R^{l}_{n, t} + W^{l}_{a, i} *M^{l}_{n, t}+ b^{l+1}_i), \nonumber \\&G^{l \rightarrow l+1}_{f, t}(h, w) = \sigma (W^{l+1}_{f} *F^{l+1}_t(h, w) + W^{l}_{s, f} *R^{l}_{n, t} + W^{l}_{a, f} *M^{l}_{n, t}+ b^{l+1}_f), \nonumber \\&G^{l \rightarrow l+1}_{o, t}(h, w) = \sigma (W^{l+1}_{o} *F^{l+1}_t(h, w) + W^{l}_{s, o} *R^{l}_{n, t} + W^{l}_{a, o} *M^{l}_{n, t}+ b^{l+1}_o), \nonumber \\&G^{l \rightarrow l+1}_{c, t}(h, w) = \tanh (W^{l+1}_{c} *F^{l+1}_t(h, w) + W^{l}_{s, c} *R^{l}_{n, t} + W^{l}_{a, c} *M^{l}_{n, t}+ b^{l+1}_c), \nonumber \\&C^{l \rightarrow l+1}_{t+1}(h, w) = G^{l \rightarrow l+1}_{f, t}(h, w) \odot C^{l \rightarrow l+1}_{t}(h, w) + G^{l \rightarrow l+1}_{i, t}(h, w) \odot G^{l \rightarrow l+1}_{c, t}(h, w), \nonumber \\&~~~~~~~~A^{l \rightarrow l+1}_{t+1}(h, w) = \tanh (G^{l \rightarrow l+1}_{o, t}(h, w) \odot C^{l \rightarrow l+1}_{t+1}(h, w)), \nonumber \\&~~~~~~~~F^{l+1}_{t+1}(h, w) = F^{l+1}_t(h, w) + A^{l \rightarrow l+1}_{t+1}(h, w), \end{aligned}$$
(5)

where \((h, w) \in \varPhi (S_n)\). W and b are convolutional kernels and biases. In Eq. (5), convolutions are denoted by \(*\), while \(\odot \) denotes the Hadamard product. Respectively, G and C represents the gate and cell state of an LSTM unit. \(A^{l \rightarrow l+1}_{t+1}\) is the augmentation feature for \(F^{l+1}_t\), and they have equal resolution. We add the augmentation feature \(A^{l \rightarrow l+1}_{t+1}\) with \(F^{l+1}_t\), producing the new feature \(F^{l+1}_{t+1}\) for the next stage. The sequence of features \(F^{l}_t\) is defined in the same way as above (with the l superscripts replaced by \(l+1\), and vice versa).

5 Implementation Details

We use the Caffe platform [39] to implement our approach. Our approach can be based on different deep architectures [29, 30, 40], and we use the ResNet-152 architecture [30] pre-trained on ImageNet dataset [10] as our backbone network. We randomly initialize the parameters of our LSTM-based bidirectional connections. Before training our network for the evaluations on different benchmarks, we follow [17, 18, 20, 25] and use the COCO dataset [2] to fine-tune the whole network.

Given an input image, we apply the structured edge detection toolbox [41] to compute super-pixels. Empirically, we set the scale of super-pixels to be 1,000 per image. The image is fed to the backbone network to compute the convolutional features. Following [31], we select the last convolutional feature map from each residual block as the initial feature maps fed into our context intertwining network. More specifically, we use the ResNet-152 network layers res2, res3, res4 and res5 as \(\left\{ F^1_0, F^2_0, F^3_0, F^4_0 \right\} \), respectively. Successive pairs of these feature maps are fed into our LSTM-based context intertwining modules, each of which has 3 bidirectional exchange stages. We optimize the segmentation network using the standard SGD solver. We fine-tune the parameters of the backbone network and the bidirectional connections.

During training, we use the common flipping, cropping, scaling and rotating of the image to augment the training data. The network is fine-tuned with a learning rate of 1e\(-3\) for 60K mini-batches. After that, we decay the learning rate to 1e\(-4\) for the next 60K mini-batches. The size of each mini-batch is set to 12. With the trained model, we perform multi-scale testing on each image to obtain the segmentation result. That is, we rescale each testing image using five factors (i.e., \(\{0.4, 0.6, 0.8, 1.0, 1.2\}\)) and feed the differently scaled versions into the network to obtain predictions. The predictions are averaged to yield the final result.

6 Experiments

We evaluate our approach on four public benchmarks for semantic segmentation, which are PASCAL VOC 2012 [1], PASCAL-Context [3], NYUDv2 [23] and SUN-RGBD [24] datasets. The PASCAL VOC 2012 dataset [1] has been widely used for evaluating segmentation performance. It contains 10,582 training images along with the pixel-wise annotations for 20 object classes and the background. The PASCAL VOC 2012 dataset also provides a validation set of 1,449 images and a test set of 1,456 images. We use this dataset for the major evaluation of our network. We further use the PASCAL-Context, NYUDv2 and SUN-RGBD datasets for extensive comparisons with state-of-the-art methods. We report all the segmentation scores in terms of mean Intersection-over-Union (IoU).

Ablation Study of MSCI. Our MSCI architecture is designed to enable exchange of multi-scale context information between feature maps. It consists of recurrent bidirectional connections defined using super-pixels. Below, we report an ablation study of our approach, which examines the effect that removing various key components has on segmentation performance. The results are summarized in Table 1.

Our approach is based on LSTMs, each of which can be regarded as a special recurrent neural network (RNN) unit with a cell state for memorization. By removing the RNNs and the cell states, we effectively disable the bidirectional connection between feature maps. In this case, our model degrades to a basic FCN, and obtains the segmentation score of 77.8 that lags far behind our full MSCI model.

Table 1. Ablation experiments on the PASCAL VOC 2012 validation set. Segmentation accuracy is reported in terms of mean IoU (%).
Table 2. Comparison of different feature combination strategies. Performance is evaluated on the PASCAL VOC 2012 and PASCAL-Context validation sets. Segmentation accuracy is reported in terms of mean IoU (%).

Next, we investigate the importance of the cell states. The cell states are employed by our approach to memorize the local and global context information, which enriches the final segmentation feature map. With all the cell states removed from our bidirectional connections, our approach achieves an accuracy of \(84.4\%\), which is significantly lower than the \(85.1\%\) accuracy of our full approach.

In our approach, the super-pixels adaptively partition the features into different regions according to the image structure, which are then used for context aggregation and exchange (Fig. 3). We remove the super-pixels and interpolate the low-resolution feature maps [17, 18] to match with the high-resolution maps. Thus, each neuron aggregates context from a local regular window. Compared to our full model, the performance drops to \(84.3\%\), demonstrating the effectiveness of using super-pixels to guide context aggregation.

Feature Combination Strategies. Our approach combines in a hierarchical manner the features produced by the bidirectional connections. In Table 2, we compare our feature combination strategy to those of other networks [9, 17, 18, 20, 25]. For a fair comparison, we reproduce the compared networks by pre-training them with the ResNet-152 backbone model on the ImageNet dataset, and fine-tuning them on the COCO dataset and the PASCAL VOC 2012 training set. Without any combination of features, the backbone network FCN model achieves the score of \(77.8\%\). Next, we compare our network to the SPP network [9, 18, 20] and Encoder-Decoder [17, 19, 21, 25, 35] network. For the SPP network, we chose a state-of-the-art model proposed in [18] for comparison. The ASPP network [20] is a variant of the SPP network, and it can achieve better results than the SPP network. For the Encoder-Decoder network, we select the model proposed in [17] for comparison here. We also compare our network with the latest Encoder-Decoder network with ASPP components [25]. These models combine the adjacent features that are learned with our bidirectional connections, which generally leads to 0.4–1.2 improvement in the segmentation scores, compared to the counterparts without bidirectional connections. We find that our approach performs better than other methods. In Fig. 4, we can also observe that MSCI provides better visual results than other methods.

Table 3. Comparisons with other state-of-the-art methods. The performances are evaluated on the PASCAL VOC 2012 validation set (left) and test set (right). Segmentation accuracy is reported in terms of mean IoU (%).
Fig. 4.
figure 4

The segmentation results of the ASPP model [20], Encoder-Decoder with ASPP model [25] and our MSCI. The images are taken from the PASCAL VOC 2012 validation set.

Fig. 5.
figure 5

The segmentation results of the ASPP model [20], Encoder-Decoder with ASPP model [25] and our MSCI. The images are scenes taken from the PASCAL-Context validation set.

Comparisons with State-of-the-Art Methods. In Table 3, we report the results of our approach on the PASCAL VOC 2012 validation set and test set, and compare with state-of-the-art methods. On the validation set (see Table 3(left)), MSCI achieves a better result than all of other methods. Specifically, given the same set of training images, it outperforms the models proposed in [17, 18, 20], which are based on SPP, ASPP and Encoder-Decoder networks, respectively. In addition, we also report our result on the test set. Our per-category results on the test set can be found on PASCAL VOC leaderboardFootnote 1. Our result of \(88.0\%\) is second only to the score reported in [25], which leverages a stronger backbone network, trained on an internal JFT-300M dataset [26,27,28].

Experiments on Scene Labeling Datasets. We perform additional experiments on three scene labeling datasets, including the PASCAL-Context [3], NYUDv2 [23], and SUN-RGBD [24]. In contrast to the object-centric PASCAL VOC 2012 dataset, these scene labeling datasets provide more complex pixel-wise annotations for objects and stuff, which require segmentation networks to have a full reasoning about the scene in an image. We use these datasets to verify if our network can label the scene images well.

The PASCAL-Context dataset [3] contains 59 categories and background, providing 4,998 images for training and 5,105 images for validation. In Table 2, we already used this dataset to compare MSCI to other feature combination strategies, and found that it works well on the scene labeling task. We provide several segmentation results in Fig. 5. Table 4 shows that MSCI outperforms other state-of-the-art methods on this dataset.

Table 4. Comparison with other state-of-the-art methods. Performance is evaluated on the PASCAL-Context validation set (left), NYUDv2 validation set (middle) and the SUN-RGBD validation set (right). Segmentation accuracy is reported in terms of mean IoU (%).

We further evaluate our method on the NYUDv2 [23] and SUN-RGBD [24] datasets, originally intended for RGB-D scene labeling. The NYUDv2 dataset [23] has 1,449 images (795 training images and 654 testing images) and pixel-wise annotations of 40 categories. The SUN-RGBD dataset [24] has 10,335 images (5,285 training images and 5,050 testing images) and pixel-wise annotations of 37 categories. Unlike the PASCAL-Context dataset, the NYUDv2 and SUN-RGBD datasets consist of images of indoor scenes. We report the segmentation scores of MSCI and other state-of-the-art methods in Table 4. We note that the best previous method proposed in [55] uses the RGB and depth information jointly for segmentation, and achieves the scores of 47.7 and 48.1 on the NYUDv2 and SUN-RGBD validation sets, respectively. Even without the depth information, MSCI outperforms the previous best results. We show some of our segmentation results on the NYUDv2 and SUN-RGBD validation sets in Fig. 6.

Fig. 6.
figure 6

MSCI segmentation results. The images are taken from the NYUDv2 validation set (left) and the SUN-RGBD validation set (right).

7 Conclusions

Recent progress in semantic segmentation may be attributed to powerful deep convolutional features and the joint consideration of local and global context information. In this work, we have proposed a novel approach for connecting and combining feature maps and context from multiple scales. Our approach uses interconnected LSTM chains in order to effectively exchange information among feature maps corresponding to adjacent scales. The enriched maps are hierarchically combined to produce a high-resolution feature map for pixel-level semantic inference. We have demonstrated that our approach is effective and outperforms the state-of-the-art on several public benchmarks.

In the future, we plan to apply our MSCI approach to stronger backbone networks and more large-scale datasets for training. In addition, we aim to extend MSCI to other recognition tasks, such as object detection and 3D scene understanding.