Multi-scale Context Intertwining for Semantic Segmentation

Lin, Di; Ji, Yuanfeng; Lischinski, Dani; Cohen-Or, Daniel; Huang, Hui

doi:10.1007/978-3-030-01219-9_37

Di Lin¹⁶,
Yuanfeng Ji¹⁶,
Dani Lischinski¹⁷,
Daniel Cohen-Or^16,18 &
…
Hui Huang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11207))

Included in the following conference series:

European Conference on Computer Vision

3213 Accesses
98 Citations

Abstract

Accurate semantic image segmentation requires the joint consideration of local appearance, semantic information, and global scene context. In today’s age of pre-trained deep networks and their powerful convolutional features, state-of-the-art semantic segmentation approaches differ mostly in how they choose to combine together these different kinds of information. In this work, we propose a novel scheme for aggregating features from different scales, which we refer to as Multi-Scale Context Intertwining (MSCI). In contrast to previous approaches, which typically propagate information between scales in a one-directional manner, we merge pairs of feature maps in a bidirectional and recurrent fashion, via connections between two LSTM chains. By training the parameters of the LSTM units on the segmentation task, the above approach learns how to extract powerful and effective features for pixel-level semantic segmentation, which are then combined hierarchically. Furthermore, rather than using fixed information propagation routes, we subdivide images into super-pixels, and use the spatial relationship between them in order to perform image-adapted context aggregation. Our extensive evaluation on public benchmarks indicates that all of the aforementioned components of our approach increase the effectiveness of information propagation throughout the network, and significantly improve its eventual segmentation accuracy.

You have full access to this open access chapter, Download conference paper PDF

Enhanced multi-scale networks for semantic segmentation

Article Open access 04 December 2023

LDANet: the laplace-guided detail-constrained asymmetric network for real-time semantic segmentation

Article 27 November 2023

Adaptive multi-scale feature fusion with spatial translation for semantic segmentation

Article 08 August 2024

Keywords

1 Introduction

Semantic segmentation is a fundamental task in computer vision, whose goal is to associate a semantic object category with each pixel in an image [1,2,3,4]. Many real-world applications, e.g., autonomous driving [4], medical analysis [5], and computational photography [6], can benefit from accurate semantic segmentation that provides detailed information about the content of an image.

In recent years, we have witnessed a tremendous progress in semantic segmentation accuracy. These advances are largely driven by the power of fully convolutional networks (FCNs) [7] and their derivatives [8, 9], which are pre-trained on large-scale datasets [2, 10]. It has also become apparent that accounting for the semantic context leads to more accurate segmentation of individual objects [9, 11,12,13,14,15,16,17,18,19].

The feature maps extracted by the deeper layers of a convolutional network encode higher-level semantic information and context contained in the large receptive field of each neuron. In contrast, the shallower layers encode appearance and location. State-of-the-art semantic segmentation approaches propagate coarse semantic context information back to the shallow layers, yielding richer features, and more accurate segmentations [7, 9, 17,18,19,20,21]. However, in these methods, context is typically propagated along the feature hierarchy in a one-directional manner, as illustrated in Fig. 1(a) and (b).

In this paper, we advocate the idea that more powerful features can be learned by enabling context to be exchanged between scales in a bidirectional manner. We refer to such information exchange as context intertwining. The intuition here is that semantics and context of adjacent scales are strongly correlated, and hence the descriptive power of the features may be significantly enhanced by such intertwining, leading to more precise semantic labeling.

Our approach is illustrated by the diagram in Fig. 1(c). Starting from a collection of multi-scale convolutional feature maps, each pair of successive feature maps is intertwined together to yield a new enriched feature map. The intertwining is modeled using two chains of long short-term memory (LSTM) units [22], which repeatedly exchange information between them, in a bidirectional fashion, as shown in Fig. 2. Each intertwining phase reduces the number of feature maps by one, resulting in a hierarchical feature combination scheme (the horizontal hierarchy in Fig. 1(c)). Eventually, a single enriched high-resolution feature map remains, which is then used for per-pixel semantic label inference.

Furthermore, rather than using fixed information propagation routes for context aggregation, we subdivide images into super-pixels, and use the spatial relationship between the super-pixel in order to define image-adapted feature connections.

We demonstrate the effectiveness of our approach by evaluating it and comparing it to an array of state-of-the-art semantic segmentation methods on four public datasets (PASCAL VOC 2012 [1], PASCAL-Context [3], NYUDv2 [23] and SUN-RGBD [24] datasets). On the PASCAL VOC 2012 validation set, we outperform the state-of-the-art (with 85.1% mean IoU). On the PASCAL VOC 2012 test set, our performance (87.0% mean IoU) is second only to the recent result of Chen et al. [25], who uses a backbone network trained on an internal JFT dataset [26,27,28], while our backbone network is trained on the ImageNet dataset [10].

2 Related Work

Fully convolutional networks (FCNs) [7] have proved effective for semantic image segmentation by leveraging the powerful convolutional features of classification networks [27, 29, 30] pre-trained on large-scale data [10, 24]. The feature maps extracted by the different convolutional layers have progressively coarser spatial resolutions, and their neurons correspond to progressively larger receptive fields in the image space. Thus, the collection of feature maps of different resolutions encodes multi-scale context information. Semantic segmentation methods have been trying to exploit this multi-scale context information for accurate segmentation. In this paper, we focus on two aspects, i.e., Feature Combination and Feature Connection, which have also been explored by most of the recent works [7, 9, 18,19,20, 31,32,33] to make better use of the image context.

Feature Combination. To capture the multi-scale context information in the segmentation features, many works combine feature maps whose neurons have different receptive fields. Various schemes for the combination of feature maps have been proposed. Spatial pyramid pooling (SPP) [34] has been successfully applied for combining different convolutional feature maps [9, 18, 20]. Generally, the last convolutional feature map, which is fed to the pixel-wise classifier, is equipped with an SPP (see Fig. 1(a)). But the SPP-enriched feature maps have little detailed information that is missed by the down-sampling operations of an FCN. Though the atrous convolution can preserve the resolutions of feature maps for more details, it requires a large budget of GPU storage for computation [27, 29, 30]. To save the GPU memory and improve the segmentation performance, some networks [17, 19, 21, 35] utilize an Encoder-Decoder (ED) network to gradually combine adjacent feature maps along the top-down hierarchy of a common FCN architecture, propagating the semantic information from the low-resolution feature maps to the high-resolution feature maps and using the high-resolution feature maps to recover the details of objects (see Fig. 1(b)). The latest work [25] further uses the ED network along with an atrous spatial pyramid pooling (ASPP) [20], and combines multi-resolution feature maps for information enrichment. In the ED network, each feature map of the decoder part only directly receives the information from the feature map at the same level of the encoder part. But the strongly-correlated semantic information, which is provided by the adjacent lower-resolution feature map of the encoder part, has to pass through additional intermediate layers to reach the same decoder layer, which may result in information decay.

In contrast, our approach directly combines pairs of adjacent feature maps in the deep network hierarchy. It creates new feature maps that directly receive the semantic information and context from a lower-resolution feature map and the improved spatial detail from a higher-resolution feature map. In addition, in our architecture the information exchange between feature maps is recurrent and bidirectional, enabling better feature learning. The pairwise bidirectional connections produce a second, horizontal hierarchy of the resulting feature maps, leading up to a full resolution context-enriched feature map (rightmost feature map in Fig. 1(c)), which is used for pixel-wise label prediction.

Feature Connection. Connections between feature maps enable the communication between neurons with different receptive field sizes, yielding new feature maps that encode multi-scale context information. Basically, FCN-based models [7,8,9, 17,18,19,20, 31] use separate neurons to represent the regular regions in an image. Normally, they use convolutional/pooling kernels with predefined shapes to aggregate the information of adjacent neurons, and propagate this information to the neurons of other feature maps. But traditional convolutional/pooling kernels only capture the context information in a local scale. To leverage richer context information, graphical models are integrated with FCNs [12, 13, 16]. Graphical models build dense connections between feature maps, allowing neurons to be more sensitive to the global image content that is critical for learning good segmentation features. Note that previous works use one-way connections that extract context information from the feature maps separately, which is eventually combined. Thus, the learned features at a given scale are not given the opportunity to optimally account for the multi-scale context information from all of the other scales.

In contrast to previous methods, our bidirectional connections exchange multi-scale context information to improve the learning of all features. We employ super-pixels computed based on the image structure, and use the relationship between them to define the exchange routes between neurons in different feature maps. This enables more adaptive context information propagation. Several previous works [31,32,33, 36] also use super-pixels to define the feature connections. And information exchange has been studied in [37, 38] for object detection. But these works do not exchange the information between feature maps of different resolutions, which is critical for semantic segmentation.

3 Multi-scale Context Intertwining

To utilize multi-scale context information, the common networks use one-way connections to combine feature maps of different resolution, following the top-down order of the network hierarchy (see Fig. 1(a) and (b)). Here, we present a multi-scale context intertwining (MSCI) architecture, where the context information can be propagated along different dimensions. The first dimension is along the vertical deep hierarchy (see Fig. 1(c)): our context intertwining scheme has connections to exchange the multi-scale context information between the adjacent feature maps. The connection is bidirectional with two different long short-term memory (LSTM) chains [22] that intertwines feature maps of different resolution in a sequence of stages. By training the LSTM units, the bidirectional connections learn to produce more powerful feature maps. The second dimension is along the horizontal hierarchy: the feature maps produced by our bidirectional connections are fed to the next phase of context intertwining, which can encode the context information memorized by our bidirectional connections into the new feature maps.

The overall MSCI architecture is illustrated in Fig. 1(c). Initially, we use the backbone FCN to compute a set $\{F^l\}$ convolutional feature maps of different resolutions, where $l=1,...,L$ and $F^{1}$ has the highest resolution. Figure 2 provides a more detailed view of context intertwining between two successive feature maps $F^l$ and $F^{l+1}$. To exchange the context information between $F^l$ and $F^{l+1}$, we construct a bidirectional connection $\mathcal {L}$:

$$\begin{aligned} \{Q^{l}, C^{l \rightarrow l+1}_T, C^{l+1 \rightarrow l}_T\} = \mathcal {L}(F^l, F^{l+1}, C^{l \rightarrow l+1}, C^{l+1 \rightarrow l}, P^{l \rightarrow l+1}, P^{l+1 \rightarrow l}, T). \end{aligned}$$

(1)

The bidirectional connection $\mathcal {L}$ consists of two different LSTM chains. One chain has the parameter set $P^{l \rightarrow l+1}$. It extracts the context information from $F^l$ and passes it to $F^{l+1}$. The other chain has the parameter set $P^{l+1 \rightarrow l}$ and passes context information from $F^{l+1}$ to $F^l$. $C^{l \rightarrow l+1}$ and $C^{l+1 \rightarrow l}$ are the cell states of the two LSTMs, and they are initialized to zeros in the very beginning. As shown in Fig. 2, the information exchange takes place over T stages. At each stage t, information is exchanged between the feature maps $F^l_t$ and $F^{l+1}_t$, yielding the maps $F^l_{t+1}$ and $F^{l+1}_{t+1}$. Note that the resulting feature map $F^{l}_T$ has higher resolution than $F^{l+1}_T$. Thus, we deconvolve the feature map $F^{l+1}_T$ with the kernel $D^{l+1}_{f}$ and add it to $F^{l}_T$ to obtain a combined high-resolution feature map $Q^{l}$:

$$\begin{aligned} Q^{l} = F^{l}_T + D^{l+1}_{f} *F^{l+1}_T. \end{aligned}$$

(2)

Note that the feature map $Q^{l}$ and the cell states $C^{l \rightarrow l+1}_T$ and $C^{l+1 \rightarrow l}_T$ can be further employed to drive the next phase of context intertwining (the next level of the horizontal hierarchy). Along the LSTM chains, the feature maps contain neurons with larger receptive fields, i.e., with richer global context. Besides, the cell states of LSTMs can memorize the context information exchanged at different stages. Due to the shortcut design of the cell states [22], the local context from the early stages can be easily propagated to the last stage, encoding the multi-scale context including the local and global information to the final feature map.

The entire MSCI process is summarized in Algorithm 1. We assume the MSCI process has K phases totally. Each phase of Algorithm 1 produces new feature maps. As each pair of feature maps is intertwined, the corresponding cell states $(C^{l \rightarrow l+1}, C^{l+1 \rightarrow l})$ are iteratively updated to provide the memorized context to assist the information exchange in the next phase. Finally, the output is the high-resolution feature map $Q^{1}_K$ that is fed to the pixel-wise classifier for segmentation. Algorithm 1 describes the feed-forward pass through the LSTMs. We remark that the LSTM parameters are reusable, and the LSTMs are trained using the standard stochastic gradient descent (SGD) algorithm with back-propagation. Below, we focus on a single context intertwining phase, and thus omit the subscript k to simplify notation.

4 Bidirectional Connection

In this section, we describe in more detail the bidirectional connections that enable mutual exchange of context information between low- and high-resolution feature maps. Our bidirectional connections are guided by the super-pixel structure of the original image, as illustrated in Fig. 3. Given an input image I, we divide it into non-overlapping super-pixels, which correspond to a set of regions $\{S_n\}$. Let $F^{l}_t$ and $F^{l+1}_t$ denote two adjacent resolution feature maps in our network, where l is the resolution level and t is the LSTM stage. The context information exchange between $F^{l}_t$ and $F^{l+1}_t$ is conducted using the regions defined by the super-pixels. Informally, at each of the two levels, for each region $S_n$ we first aggregate the neurons whose receptive fields are centered inside $S_n$. Next, we sum together the aggregated features of $S_n$ and all of its neighboring regions at one level and pass the resulting context information to the neurons of the other level that reside in region $S_n$. This is done in both directions, as shown in Fig. 3(a) and (b). Thus, we enrich the locally aggregated context information of each neuron with that of its counterpart in the other level, as well as with the more global context aggregated from the surrounding regions. Our results show that this significantly improves segmentation accuracy.

Formally, given the feature map $F^{l}_t$ and a region $S_n$, we first aggregate the neurons in $S_n$, yielding a regional context feature $R^{l}_{n, t} \in \mathbb {R}^{C}$:

$$\begin{aligned} R^{l}_{n, t} = \sum _{(h, w) \in \varPhi (S_n)} F^{l}_{t}(h, w), \end{aligned}$$

(3)

where $\varPhi (S_n)$ denotes the set of centers of the receptive fields inside the region $S_n$. Next, we define a more global context feature $M^{l}_{n, t}$, by aggregating the regional features of $S_n$ and of its adjacent regions $\mathcal {N}(S_n)$:

$$\begin{aligned} M^{l}_{n, t} = \sum _{S_m \in \mathcal {N}(S_n)} R^{l}_{m, t}. \end{aligned}$$

(4)

The above features are propagated bidirectionally between $F^{l}_t$ and $F^{l+1}_t$ using a pair of LSTM chains, as illustrated in Fig. 2. In the $t^{th}$ stage, an LSTM unit generates a new feature $F^{l+1}_{t+1}$ from $F^{l+1}_{t}$, $R^{l}_{n, t}$, and $M^{l}_{n, t}$, as follows:

$$\begin{aligned}&G^{l \rightarrow l+1}_{i, t}(h, w) = \sigma (W^{l+1}_{i} *F^{l+1}_t(h, w) + W^{l}_{s, i} *R^{l}_{n, t} + W^{l}_{a, i} *M^{l}_{n, t}+ b^{l+1}_i), \nonumber \\&G^{l \rightarrow l+1}_{f, t}(h, w) = \sigma (W^{l+1}_{f} *F^{l+1}_t(h, w) + W^{l}_{s, f} *R^{l}_{n, t} + W^{l}_{a, f} *M^{l}_{n, t}+ b^{l+1}_f), \nonumber \\&G^{l \rightarrow l+1}_{o, t}(h, w) = \sigma (W^{l+1}_{o} *F^{l+1}_t(h, w) + W^{l}_{s, o} *R^{l}_{n, t} + W^{l}_{a, o} *M^{l}_{n, t}+ b^{l+1}_o), \nonumber \\&G^{l \rightarrow l+1}_{c, t}(h, w) = \tanh (W^{l+1}_{c} *F^{l+1}_t(h, w) + W^{l}_{s, c} *R^{l}_{n, t} + W^{l}_{a, c} *M^{l}_{n, t}+ b^{l+1}_c), \nonumber \\&C^{l \rightarrow l+1}_{t+1}(h, w) = G^{l \rightarrow l+1}_{f, t}(h, w) \odot C^{l \rightarrow l+1}_{t}(h, w) + G^{l \rightarrow l+1}_{i, t}(h, w) \odot G^{l \rightarrow l+1}_{c, t}(h, w), \nonumber \\&~~~~~~~~A^{l \rightarrow l+1}_{t+1}(h, w) = \tanh (G^{l \rightarrow l+1}_{o, t}(h, w) \odot C^{l \rightarrow l+1}_{t+1}(h, w)), \nonumber \\&~~~~~~~~F^{l+1}_{t+1}(h, w) = F^{l+1}_t(h, w) + A^{l \rightarrow l+1}_{t+1}(h, w), \end{aligned}$$

(5)

where $(h, w) \in \varPhi (S_n)$. W and b are convolutional kernels and biases. In Eq. (5), convolutions are denoted by $*$, while $\odot $ denotes the Hadamard product. Respectively, G and C represents the gate and cell state of an LSTM unit. $A^{l \rightarrow l+1}_{t+1}$ is the augmentation feature for $F^{l+1}_t$, and they have equal resolution. We add the augmentation feature $A^{l \rightarrow l+1}_{t+1}$ with $F^{l+1}_t$, producing the new feature $F^{l+1}_{t+1}$ for the next stage. The sequence of features $F^{l}_t$ is defined in the same way as above (with the l superscripts replaced by $l+1$, and vice versa).

5 Implementation Details

We use the Caffe platform [39] to implement our approach. Our approach can be based on different deep architectures [29, 30, 40], and we use the ResNet-152 architecture [30] pre-trained on ImageNet dataset [10] as our backbone network. We randomly initialize the parameters of our LSTM-based bidirectional connections. Before training our network for the evaluations on different benchmarks, we follow [17, 18, 20, 25] and use the COCO dataset [2] to fine-tune the whole network.

Given an input image, we apply the structured edge detection toolbox [41] to compute super-pixels. Empirically, we set the scale of super-pixels to be 1,000 per image. The image is fed to the backbone network to compute the convolutional features. Following [31], we select the last convolutional feature map from each residual block as the initial feature maps fed into our context intertwining network. More specifically, we use the ResNet-152 network layers res2, res3, res4 and res5 as $\left\{ F^1_0, F^2_0, F^3_0, F^4_0 \right\} $, respectively. Successive pairs of these feature maps are fed into our LSTM-based context intertwining modules, each of which has 3 bidirectional exchange stages. We optimize the segmentation network using the standard SGD solver. We fine-tune the parameters of the backbone network and the bidirectional connections.

During training, we use the common flipping, cropping, scaling and rotating of the image to augment the training data. The network is fine-tuned with a learning rate of 1e$-3$ for 60K mini-batches. After that, we decay the learning rate to 1e$-4$ for the next 60K mini-batches. The size of each mini-batch is set to 12. With the trained model, we perform multi-scale testing on each image to obtain the segmentation result. That is, we rescale each testing image using five factors (i.e., $\{0.4, 0.6, 0.8, 1.0, 1.2\}$) and feed the differently scaled versions into the network to obtain predictions. The predictions are averaged to yield the final result.

6 Experiments

We evaluate our approach on four public benchmarks for semantic segmentation, which are PASCAL VOC 2012 [1], PASCAL-Context [3], NYUDv2 [23] and SUN-RGBD [24] datasets. The PASCAL VOC 2012 dataset [1] has been widely used for evaluating segmentation performance. It contains 10,582 training images along with the pixel-wise annotations for 20 object classes and the background. The PASCAL VOC 2012 dataset also provides a validation set of 1,449 images and a test set of 1,456 images. We use this dataset for the major evaluation of our network. We further use the PASCAL-Context, NYUDv2 and SUN-RGBD datasets for extensive comparisons with state-of-the-art methods. We report all the segmentation scores in terms of mean Intersection-over-Union (IoU).

Ablation Study of MSCI. Our MSCI architecture is designed to enable exchange of multi-scale context information between feature maps. It consists of recurrent bidirectional connections defined using super-pixels. Below, we report an ablation study of our approach, which examines the effect that removing various key components has on segmentation performance. The results are summarized in Table 1.

Our approach is based on LSTMs, each of which can be regarded as a special recurrent neural network (RNN) unit with a cell state for memorization. By removing the RNNs and the cell states, we effectively disable the bidirectional connection between feature maps. In this case, our model degrades to a basic FCN, and obtains the segmentation score of 77.8 that lags far behind our full MSCI model.

Table 1. Ablation experiments on the PASCAL VOC 2012 validation set. Segmentation accuracy is reported in terms of mean IoU (%).

Full size table

Table 2. Comparison of different feature combination strategies. Performance is evaluated on the PASCAL VOC 2012 and PASCAL-Context validation sets. Segmentation accuracy is reported in terms of mean IoU (%).

Full size table

Next, we investigate the importance of the cell states. The cell states are employed by our approach to memorize the local and global context information, which enriches the final segmentation feature map. With all the cell states removed from our bidirectional connections, our approach achieves an accuracy of $84.4\%$, which is significantly lower than the $85.1\%$ accuracy of our full approach.

In our approach, the super-pixels adaptively partition the features into different regions according to the image structure, which are then used for context aggregation and exchange (Fig. 3). We remove the super-pixels and interpolate the low-resolution feature maps [17, 18] to match with the high-resolution maps. Thus, each neuron aggregates context from a local regular window. Compared to our full model, the performance drops to $84.3\%$, demonstrating the effectiveness of using super-pixels to guide context aggregation.

Feature Combination Strategies. Our approach combines in a hierarchical manner the features produced by the bidirectional connections. In Table 2, we compare our feature combination strategy to those of other networks [9, 17, 18, 20, 25]. For a fair comparison, we reproduce the compared networks by pre-training them with the ResNet-152 backbone model on the ImageNet dataset, and fine-tuning them on the COCO dataset and the PASCAL VOC 2012 training set. Without any combination of features, the backbone network FCN model achieves the score of $77.8\%$. Next, we compare our network to the SPP network [9, 18, 20] and Encoder-Decoder [17, 19, 21, 25, 35] network. For the SPP network, we chose a state-of-the-art model proposed in [18] for comparison. The ASPP network [20] is a variant of the SPP network, and it can achieve better results than the SPP network. For the Encoder-Decoder network, we select the model proposed in [17] for comparison here. We also compare our network with the latest Encoder-Decoder network with ASPP components [25]. These models combine the adjacent features that are learned with our bidirectional connections, which generally leads to 0.4–1.2 improvement in the segmentation scores, compared to the counterparts without bidirectional connections. We find that our approach performs better than other methods. In Fig. 4, we can also observe that MSCI provides better visual results than other methods.

Table 3. Comparisons with other state-of-the-art methods. The performances are evaluated on the PASCAL VOC 2012 validation set (left) and test set (right). Segmentation accuracy is reported in terms of mean IoU (%).

Full size table

Comparisons with State-of-the-Art Methods. In Table 3, we report the results of our approach on the PASCAL VOC 2012 validation set and test set, and compare with state-of-the-art methods. On the validation set (see Table 3(left)), MSCI achieves a better result than all of other methods. Specifically, given the same set of training images, it outperforms the models proposed in [17, 18, 20], which are based on SPP, ASPP and Encoder-Decoder networks, respectively. In addition, we also report our result on the test set. Our per-category results on the test set can be found on PASCAL VOC leaderboard^{Footnote 1}. Our result of $88.0\%$ is second only to the score reported in [25], which leverages a stronger backbone network, trained on an internal JFT-300M dataset [26,27,28].

Experiments on Scene Labeling Datasets. We perform additional experiments on three scene labeling datasets, including the PASCAL-Context [3], NYUDv2 [23], and SUN-RGBD [24]. In contrast to the object-centric PASCAL VOC 2012 dataset, these scene labeling datasets provide more complex pixel-wise annotations for objects and stuff, which require segmentation networks to have a full reasoning about the scene in an image. We use these datasets to verify if our network can label the scene images well.

The PASCAL-Context dataset [3] contains 59 categories and background, providing 4,998 images for training and 5,105 images for validation. In Table 2, we already used this dataset to compare MSCI to other feature combination strategies, and found that it works well on the scene labeling task. We provide several segmentation results in Fig. 5. Table 4 shows that MSCI outperforms other state-of-the-art methods on this dataset.

Table 4. Comparison with other state-of-the-art methods. Performance is evaluated on the PASCAL-Context validation set (left), NYUDv2 validation set (middle) and the SUN-RGBD validation set (right). Segmentation accuracy is reported in terms of mean IoU (%).

Full size table

We further evaluate our method on the NYUDv2 [23] and SUN-RGBD [24] datasets, originally intended for RGB-D scene labeling. The NYUDv2 dataset [23] has 1,449 images (795 training images and 654 testing images) and pixel-wise annotations of 40 categories. The SUN-RGBD dataset [24] has 10,335 images (5,285 training images and 5,050 testing images) and pixel-wise annotations of 37 categories. Unlike the PASCAL-Context dataset, the NYUDv2 and SUN-RGBD datasets consist of images of indoor scenes. We report the segmentation scores of MSCI and other state-of-the-art methods in Table 4. We note that the best previous method proposed in [55] uses the RGB and depth information jointly for segmentation, and achieves the scores of 47.7 and 48.1 on the NYUDv2 and SUN-RGBD validation sets, respectively. Even without the depth information, MSCI outperforms the previous best results. We show some of our segmentation results on the NYUDv2 and SUN-RGBD validation sets in Fig. 6.

7 Conclusions

Recent progress in semantic segmentation may be attributed to powerful deep convolutional features and the joint consideration of local and global context information. In this work, we have proposed a novel approach for connecting and combining feature maps and context from multiple scales. Our approach uses interconnected LSTM chains in order to effectively exchange information among feature maps corresponding to adjacent scales. The enriched maps are hierarchically combined to produce a high-resolution feature map for pixel-level semantic inference. We have demonstrated that our approach is effective and outperforms the state-of-the-art on several public benchmarks.

In the future, we plan to apply our MSCI approach to stronger backbone networks and more large-scale datasets for training. In addition, we aim to extend MSCI to other recognition tasks, such as object detection and 3D scene understanding.

Notes

1.
http://host.robots.ox.ac.uk:8080/anonymous/F58739.html.

References

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. IJCV 88, 303–338 (2010)
Article Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)
Google Scholar
Cordts, M., et al.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Chen, H., Qi, X., Yu, L., Dou, Q., Qin, J., Heng, P.A.: DCAN: deep contour-aware networks for object instance segmentation from histology images. Med. Image Anal. 36, 135–146 (2017)
Article Google Scholar
Yoon, Y., Jeon, H.G., Yoo, D., Lee, J.Y., Kweon, I.S.: Light-field image super-resolution using convolutional neural network. IEEE Signal Process. Lett. 24, 848–852 (2017)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV (2015)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)
Google Scholar
Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)
Google Scholar
Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV (2015)
Google Scholar
Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. arXiv preprint arXiv:1502.02734 (2015)
Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)
Google Scholar
Lin, G., Shen, C., van den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR (2016)
Google Scholar
Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. arXiv (2016)
Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. arXiv (2016)
Google Scholar
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. arXiv (2017)
Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv (2017)
Google Scholar
Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: CVPR (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)
Google Scholar
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611 (2018)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS (2014)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)
Google Scholar
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_8
Chapter Google Scholar
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.: Semantic object parsing with local-global long short-term memory. In: CVPR, pp. 3185–3193 (2016)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Google Scholar
Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.V.: Superpixel convolutional networks using bilateral inceptions. In: ECCV (2016)
Google Scholar
Bell, S., Lawrence Zitnick, C., Bala, K., Girshick, R.: Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: CVPR (2016)
Google Scholar
Zeng, X.: Crafting GBD-Net for object detection. PAMI 40, 2109–2123 (2017)
Article Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)
Google Scholar
Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. In: ICCV (2013)
Google Scholar
Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502 (2017)
Sun, H., Xie, D., Pu, S.: Mixed context networks for semantic segmentation. arXiv preprint arXiv:1610.05854 (2016)
Wu, Z., Shen, C., Hengel, A.v.d.: Wider or deeper: revisiting the ResNet model for visual recognition. arXiv preprint arXiv:1611.10080 (2016)
Shen, F., Gan, R., Yan, S., Zeng, G.: Semantic segmentation via structured patch prediction, context CRF and guidance CRF. In: CVPR (2017)
Google Scholar
Wang, G., Luo, P., Lin, L., Wang, X.: Learning object interactions and descriptions for semantic image segmentation. In: CVPR (2017)
Google Scholar
Fu, J., Liu, J., Wang, Y., Lu, H.: Stacked deconvolutional network for semantic segmentation. arXiv preprint arXiv:1708.04943 (2017)
Luo, P., Wang, G., Lin, L., Wang, X.: Deep dual learning for semantic image segmentation. In: CVPR (2017)
Google Scholar
Dai, J., He, K., Sun, J.: BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV (2015)
Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
Google Scholar
Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv (2015)
Google Scholar
He, Y., Chiu, W.C., Keuper, M., Fritz, M.: RGBD semantic segmentation using spatio-temporal data-driven pooling. arXiv (2016)
Google Scholar
Wu, Z., Shen, C., Hengel, A.V.D.: High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339 (2016)
Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: ACCV (2016)
Google Scholar
Lin, D., Chen, G., Cohen-Or, D., Heng, P.A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV (2017)
Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their constructive comments. This work was supported in part by NSFC (61702338, 61522213, 61761146002, 61861130365), 973 Program (2015CB352501), Guangdong Science and Technology Program (2015A030312015), Shenzhen Innovation Program (KQJSCX20170727101233642, JCYJ20151015151249564), and ISF-NSFC Joint Research Program (2472/17).

Author information

Authors and Affiliations

Shenzhen University, Shenzhen, China
Di Lin, Yuanfeng Ji, Daniel Cohen-Or & Hui Huang
The Hebrew University of Jerusalem, Jerusalem, Israel
Dani Lischinski
Tel Aviv University, Tel Aviv, Israel
Daniel Cohen-Or

Authors

Di Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuanfeng Ji
View author publications
You can also search for this author in PubMed Google Scholar
Dani Lischinski
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Cohen-Or
View author publications
You can also search for this author in PubMed Google Scholar
Hui Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Huang .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, D., Ji, Y., Lischinski, D., Cohen-Or, D., Huang, H. (2018). Multi-scale Context Intertwining for Semantic Segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11207. Springer, Cham. https://doi.org/10.1007/978-3-030-01219-9_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-01219-9_37
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01218-2
Online ISBN: 978-3-030-01219-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-scale Context Intertwining for Semantic Segmentation

Abstract

Similar content being viewed by others

Enhanced multi-scale networks for semantic segmentation

LDANet: the laplace-guided detail-constrained asymmetric network for real-time semantic segmentation

Adaptive multi-scale feature fusion with spatial translation for semantic segmentation

Keywords

1 Introduction

2 Related Work

3 Multi-scale Context Intertwining

4 Bidirectional Connection

5 Implementation Details

6 Experiments

7 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us