1 Introduction

Depth acquisition has been actively studied over the past decades with widespread applications in 3D modeling, scene understanding, depth-aware image synthesis, etc. However, traditional hardware or software based approaches are restricted by either environment or multi-view observations assumption. To overcome these limitations, there is a growing interest in predicting depth from a single image.

Monocular depth prediction is an ill-posed problem and inherently ambiguous. However, humans can well perceive depth from a single image, given that sufficient samples (e.g. the appearances of nearby/distant objects) have been learned over lifetimes. With the success of deep learning techniques and available training data, the performance of monocular depth estimation has been greatly improved [5, 53]. While existing methods measure depth estimation accuracy by vanilla loss functions (e.g. \(\ell _1\) or \(\ell _2\)), they assume that all regions in the scene contribute equally without considering the depth data statistics. We have empirically found that the depth values in the indoor/outdoor scenes vary greatly across different regions and exhibit a long tail distribution (see Fig. 1). This is an inherent property of the nature that mainly caused by the perspective-effect during the depth acquisition process. Given such imbalanced data, loss functions that treat all regions equally will be dominated by the samples with small depth, leading the models to be “short-sighted” and not effective to predict the depth of distant regions.

Moreover, complement to the learned prior knowledge like perspective, semantic understanding of the scene (e.g. sky is faraway, wall is vertical) essentially benefits depth estimation. For example, knowing a cylinder-like object to be a pencil or a pole can help estimate its depth. Furthermore, depth information is also helpful to differentiate semantic labels, especially for different objects with similar appearances [4, 11, 41]. Estimating depth and semantics can thus be mutually beneficial. Unfortunately, there is a lack of strategy to efficiently propagate and share information across the two tasks.

In this work, we propose to address the above two challenges by presenting a deep network to predict depth as well as semantic labels from a single still image. A novel attention-driven loss with depth-aware objective is proposed to supervise the network training, which alleviates the data bias issue and guides the model to look deeper into the scene. In addition, in our synergy network architecture, we propose an information propagation strategy that performs in a dynamic routing fashion to better incorporate semantics into depth estimation. The strategy is achieved by a lateral sharing unit and a semi-dense skip-up connection, which allow information to propagate through internal representations across and within both tasks. Experimental results on the challenging indoor dataset show that, with the proposed loss and knowledge sharing strategy, the performance of monocular depth estimation is significantly improved and reaching state-of-the-art. Our contributions are summarized as follows:

  • We propose a novel attention-driven loss to better supervise the network training on existing datasets with long tail distributions. It helps improve depth prediction performance especially for distant regions.

  • We present a synergy network architecture that better propagates semantic information to depth prediction, via a proposed information propagation strategy for both inter- and intra-task knowledge sharing.

  • Extensive experiments demonstrate the effectiveness of our method with state-of-the-art performance on both depth and semantics prediction tasks.

2 Related Work

Depth from Single Image. Early works on monocular depth estimation mainly leverage hand-crafted features. Saxena et al. [44] predict the monocular depth by a linear model on an over-segmented input image. Hoiem et al. [17] further group the superpixels into geometric meaningful labels and construct a 3D model accordingly. Later on, with large-scale RGB-D data available, data-driven approaches [21, 22, 27, 28, 30, 35, 43] become feasible. Eigen et al. [4, 5] construct a multi-scale deep convolutional neural network (CNN) to produce dense depth maps. Some methods [24, 29, 34, 51,52,53, 56] try to increase the accuracy by including Conditional Random Fields (CRFs). Despite notable improvements, the model complexity increases as well. Other works [1, 57] predict depth by exploring ordinal relationships. Data imbalance is reported in [28, 43] while not explicitly addressed. Some other works [6, 9, 26, 55] propose to supervise the network by a reconstruction loss from the other stereo or temporal view. While requiring no depth supervision, rectification and alignment are usually necessary, and they rely on multi-view images during training. Although remarkable performance has been achieved, the long tail property of depth data distribution has not yet been well-explored.

Depth with Semantics. As depth and semantic labels share context information, some methods [3, 4, 11, 42, 46] take depth map as a guidance to improve the semantic segmentation performance. In [46], Silberman et al. propose the NYU RGBD dataset and use the combination of RGB and depth to improve the segmentation. Based on this dataset, some methods [3, 11] take RGBD as input to perform semantic segmentation. Eigen and Fergus [4] design a deep CNN that takes RGB, depth, surface normal as input to predict the semantic labels. Owing to the power of CNN models, other methods [41, 49, 50] are proposed to better leverage depth for semantic labeling recently. While great performance has been demonstrated, the ground truth depth is indispensable for the labeling task. On the other hand, prior information encoded in the semantic labels can be leveraged to assist depth prediction. Instead of directly mapping from color image to depth, Liu et al. [33] first perform a semantic segmentation on the scene and then use the labels to guide depth prediction, in a sequential manner.

Joint Representation Sharing. Some recent works attempt to investigate the representation sharing between different tasks [16, 19, 20, 27, 38, 39, 51]. Ladicky et al. [27] propose a semantic depth classifier and analyze perspective geometry for image manipulation, whereas they rely on hand-crafted features locally. In [12], a traditional framework is presented for joint segmentation and 3D reconstruction. Wang et al. [51] use a CNN following by a hierarchical CRF to jointly predict semantic labels and depth. However, they only modify the last layer for prediction and rely on superpixels and CRF. A concurrent work [23] proposes a weighting strategy for multi-task losses. Misra et al. [38] propose a cross-stitch (CS) network for multi-task learning. While performs better than baselines, it may suffer from propagation interruption if the combination weights degenerates into 0. The two-parallel-CNN design also increases the number of parameters and learning complexity. Another sharing approach [18] applying dense connections between each layer in a CNN is proposed for recognition tasks. The fully-dense connections share all the information but increase memory consumption as well.

In our work, we jointly train semantic labeling and depth estimation in an end-to-end fashion, without complicated pre- or post-processing. We also propose to capture better synergy representations between the two tasks. Furthermore, we investigate the long-tail data distribution in existing datasets and propose an attention-driven loss to better supervise the network training.

Fig. 1.
figure 1

Long tail distributed datasets on depth and semantic labels. Vertical axes indicate the number of pixels. (a) shows the depth value (horizontal axis, in meter) distribution of the NYUD v2 dataset [46], and (b) shows the distribution of the KITTI dataset [7]. (c) gives the semantic label distribution (label index as horizontal axis) of the NYUD v2, while (d, e) are the distributions of the mapped 40 [10] and 4 [46] categories from the 800+ categories in (c). Imbalanced long-tailed distribution can be observed in these datasets, even for semantic labels mapped to only four categories.

3 Depth-Aware Synergy Network

3.1 Depth-Aware Objective

Most state-of-the-art monocular depth estimation methods make use of CNNs to enable accurate depth prediction. In these frameworks, the depth prediction is formulated as a regression problem, where \(\ell _1\) or \(\ell _2\) loss is usually used to minimize the pixel-wise distance between the predicted and ground truth depth maps based on the training data. When estimating monocular depth, we observe that a long tail distribution resides in both indoor (NYUD v2 [46]) and outdoor (KITTI [7]) depth datasets. As shown in Fig. 1(a), (b), the number of samples/pixels per depth value falls dramatically after a particular depth, with only a small depth range dominating a large number of pixels. This data imbalance problem shares similarity with that in object detection [32, 45] but differs in nature. It is because the inherent natural property of perspective effect from the imaging process leads to the uneven distribution of depth pixels, which can not be eliminated by simply increasing training data. As a result, training deep models on such datasets using the loss functions that treat all pixels equally as in previous works can be problematic. The easy samples with small depth pixel values can easily overwhelm the training while hard samples with large depth pixel values have very limited contribution, leading the models tend to predict smaller depth values.

Based on the above observations, we propose to guide the network to pay more attentions to the distant depth regions during training and adaptively adjust the backpropagation flow accordingly. The proposed depth-aware objective is formulated as:

$$\begin{aligned} {L_{DA}} = \frac{1}{N}\sum \limits _{i = 1}^N {({\alpha _D} + {\lambda _D} ) \cdot \ell ({d_i},d_i^{GT})}, \end{aligned}$$
(1)

where i is the pixel index, N is the number of pixels in the depth map. \(d_i\) and \(d_i^{GT}\) are the predicted depth value and ground truth value respectively. \(\ell (\cdot )\) is a distance metric can be \(\ell _1, \ell _2\), etc. \(\alpha _D\) is a depth-aware attention term that guides the network to focus more on distant hard depth regions to reduce the data distribution bias. Therefore, the gradients during backpropagation weight more on minority distant regions with respect to vast nearby regions. In this way, \(\alpha _D\) should be positively correlated to the depth and can be defined as a linear function with respect to the ground truth depth.

To avoid gradient vanishing at the beginning of training and avoid cutting off of learning for nearby regions, a regularization term \(\lambda _D\) is introduced along with the attention term as:

$$\begin{aligned} {\lambda _D} = 1 - \frac{{\min (\log ({d_i}),\log (d_i^{GT}))}}{{\max (\log ({d_i}),\log (d_i^{GT}))}}, \end{aligned}$$
(2)

which describes the learning state during training. If the network at current state predicts pixel i close to the ground truth, the regularization term \(\lambda _D\) approaches 0. When the network does not accurately predicts the value, \(\lambda _D\) approaches 1. As a result, even for very near (\({\alpha _D} \rightarrow 0\)) regions that are not accurately predicted, the gradients can still be backpropagated, which approaches the original \(\ell \) loss function. In this way, Eq. 2 ensures the stableness during training. Our depth-aware objective guides the network to adaptively focus on different regions and automatically adjusts the strength/attention for each training sample, thus ensures the optimization direction of the model to be comparatively balanced. In sum, while \(L_{DA}\) preserves the focus on nearby pixel samples, it enables the network to put more attentions on the distant ones during training.

Fig. 2.
figure 2

Overview of the proposed network architecture. A single RGB image is fed into the shared backbone encoder network (purple), and then decoupled to the depth prediction (grey) and semantic labeling (pink) sub-networks. Knowledge between the two sub-networks is shared through lateral sharing units (details in Fig. 3 left) for both inference and backpropagation, together with internal sharing by semi-dense up-skip connections (Fig. 3 right). The training is supervised by an attention loss (Sect. 3.3). (Color figure online)

3.2 Network Architecture

The proposed synergy network is a multi-task deep CNN that mainly consists of four parts: the depth prediction sub-network, semantic labeling sub-network, knowledge sharing unit/connection, and the attention-driven loss. An overview architecture is shown in Fig. 2. The input RGB image is passed through a backbone encoder (e.g. VGG [47], ResNet [14]) to convert the color space into a high-dimension feature space. Following the backbone are two sub-networks reconstructing the depth and semantic labels from the shared high-dimension feature. Knowledge sharing between these two tasks is achieved by a Lateral Sharing Unit (LSU), which is proposed to automatically learn the propagation flow during the training process and results in an optimum structure at test time. Besides, knowledge sharing is also performed internally at each sub-network through the proposed semi-dense up-skip connections (SUC). Finally, the whole training process is supervised by an attention-driven loss which consists of the proposed depth-aware and other attention-based loss terms.

Fig. 3.
figure 3

Left: Structure of the proposed lateral sharing unit at every two consecutive up-conv layers D1 and D2, with identity mappings (black links). Right: Structure of the proposed semi-dense up-skip connections; dotted lines indicate up-skip connections, with operator \(\hslash \) (bilinear up-sampling with convolution) shown on the right.

Fig. 4.
figure 4

Illustration on the effectiveness of LSU. All depth maps are with the same scale.

Lateral Sharing Unit. We empirically explore different information sharing structures, which reveals that different multi-task networks result in diverse performance and the knowledge sharing strategy is hard to tune manually. In our synergy network, we propose a bi-directional Lateral Sharing Unit (LSU) to automatically learn the sharing strategy in a dynamic routing fashion. Information sharing is achieved for both forward pass and backpropagation. Between every two up-conv layers in the network, we add such LSU to share residual knowledge/representations from the other task, in addition to the intra-task propagation. Different from hand-tuned structures, our LSU is able to acquire additional fractional sharing from inter and intra-task layers. Specifically, the structure of LSU is illustrated in Fig. 3 left, which provides fully-sharing routes between the two tasks. Suppose the feature maps generated by current up-conv layers are D1 and S1. Then the feature representation for sharing can be formed as,

$$\begin{aligned} \left\{ \begin{array}{l} LS{U_{D2}} = D1 + ({\varphi _D} \cdot D1 + {\varphi _S} \cdot S1)\\ LS{U_{S2}} = S1 + ({\gamma _D} \cdot D1 + {\gamma _S} \cdot S1) \end{array} \right. , \end{aligned}$$
(3)

where \(\varphi _D, \gamma _D\) are the weighted parameters for feature D1, and \(\varphi _S, \gamma _S\) for feature S1. The sharing representations \(LSU_{D2}\) and \(LSU_{S2}\) are propagated to the subsequent up-conv layers. Note all the parameters in LSU are learnt during training, resulting in dynamic sharing route between every two up-conv layers. Although all LSUs share same internal structure, their parameters are not tied, allowing for a more flexible sharing. We propose to add identity mappings in addition to the combined sharing. With identity mappings, the intra-task information propagation is ensured, avoiding the risk of “propagation interruption” or feature pollution. Such residual-like structure (identity connection [15] associated with the residual sharing) also benefits efficient backpropagation of gradients. In addition, our LSU is applied between consecutive up-conv layers, instead of the encoding backbone. In this way, much fewer combination parameters and network parameters need to learn. An example illustrates the effectiveness of our LSU is shown in Fig. 4. We can see that when incorporating LSU, semantics is propagated to the depth thus improve its accuracy (the top-right cabinet). Whereas if without the identity mapping, artifacts may also be introduced by the semantic propagation (bottom-right cabinet). With identity mapping, less artifacts and higher accuracy can be achieved (the fourth column).

Semi-dense Up-skip Connections. In order to perform better intra-task knowledge sharing and preserve long-term memory, we introduce the Semi-dense Up-skip Connections (SUCs) between up-conv layers, as shown in Fig. 2 and detailed in Fig. 3 right. Denote \(f_{in}\) and \(f_{out}\) as the input and output features of the decoder, the output features of each up-conv layer as \(f_i\). In addition to the short-term memory from preceding single up-conv layer, we add skip connections to propagate long-term memory. Therefore, our SUC is formulated as,

$$\begin{aligned} f_{out}=\hslash (f_{in})+\sum _{i=1}^{n}\hslash (f_i), \end{aligned}$$
(4)

where n is the number of up-conv layers (\(n=4\) in our work), and \(\hslash \) denotes an up-resize operation in order to match the size of feature in the last up-conv layer. We also tried the concatenation of features which performs slightly worse than the summation. Our SUC is performed in a semi-dense manner between adjacent up-conv layers, instead of fully-dense in the encoder. In this way, the memory consumption is reduced to a large extent without performance sacrifice according to our experiment. In addition, with long- short-term connections the features from different up-conv steps are able to fuse in a coarse-to-fine multi-scale fashion, which incorporates both global and local information.

3.3 Attention-Driven Loss

Depth-Aware Loss. As defined in Sect. 3.1, during training, we use depth-aware loss term (Eq. 1) to supervise the depth prediction task. Specially, we set the attention term \(\alpha _D=d^{GTn}\) where \(d^{GTn}\) is the normalized ground truth depth (attention guidance in Fig. 2) over whole range. The distance metric \(\ell \) is set as reverse smooth \(L_1\)-norm [8, 28] due to its robustness.

Joint Gradient Loss. In order to better preserve details on local structure and surface regions, we propose to set constraints on gradients and introduce the gradient loss layers with kernels set as the Sobel detector in both horizontal (\(\nabla _h\)) and vertical (\(\nabla _v\)) directions,

$$\begin{aligned} {L_g}({d},d^{GT}) = \frac{1}{N}\sum \limits _{i = 1}^N {\left| {{\nabla _h}{d_i} - {\nabla _h}d_i^{GT}} \right| + \left| {{\nabla _v}{d_i} - {\nabla _v}d_i^{GT}} \right| }. \end{aligned}$$
(5)

In addition, the semantic information is also taken into consideration as a joint gradient loss term, by substituting the semantic segmentation result s for \(d^{GT}\) as: \({L_g}({d},{s})\). Then the joint gradient loss term is formulated as \(L_{JG}={L_g}({d},d^{GT})+{L_g}({d},{s})\).

Semantic Focal Loss. As shown in Fig. 1 (c–e), the category distribution also belongs to a long-tailed one, even mapping to much fewer number (e.g. 40 or 4) of categories. Such imbalanced distribution not only influences the semantic labeling task but also impacts the depth prediction through LSUs and backpropagation. Inspired by the Focal Loss [32] proposed for object detection, we propose to guide the network to pay more attention to the hard tailed categories and set the loss term as,

$$\begin{aligned} {L_{semF}}(l,{l^{GT}}) = - \frac{1}{N}\sum \limits _{i = 1}^N {\sum \limits _{k = 1}^K {l_{i,k}^{GT}{\alpha _k}{{(1 - {l_{i,k}})}^\gamma }\log ({l_{i,k}})} }, \end{aligned}$$
(6)

where \({l_i}\) is the label prediction at pixel i and k is the category index. \(\alpha _k\) and \(\gamma \) are the balancing weight and focusing parameter to modulate the loss attention.

Fig. 5.
figure 5

Network attention visualization. Given an input RGB image, the spatial attention of the network is shown as an overlay to the input.

The above loss terms/layers consist the proposed attention-driven loss as in Fig. 2, which is defined as,

$$\begin{aligned} {L_{attention}} = {L_{DA}} + {L_{JG}} + {L_{semF}}. \end{aligned}$$
(7)

3.4 Attention Visualization

In order to better illustrate the proposed attention-driven loss, we visualize the learned attention of the network, i.e. which region the network focuses more on. Following [54], here we use the spatial attention map to show the network attention. The attention maps of the network on monocular depth estimation in shown in Fig. 5 (second column) as heat-map, where red indicates high values. Note that the attention map here is different from the attention guidance in Fig. 2, although they share the similar high-level meaning. Here the attention map is represented by the aggregation of the feature activations from the first up-conv layer. In addition to the depth estimation, the attention maps of the shared backbone and semantic labeling are also presented for a thorough understanding of the network attention distribution in Fig. 5.

From the visualization we can see the network mainly focuses on distant regions when performing monocular depth estimation. On the other hand, the shared backbone focuses on a larger region around the distant area, indicating a more general attention on the whole scene while still driven by the distance. For the attention of semantic labeling, besides the dominant categories, some “tailed” categories also receive high attention, e.g. television, books, bag, etc. The above attention visualization results provide a better understanding of the network focus and validate the mechanism of the proposed attention-driven approach.

4 Experiments

In this section, we evaluate the proposed approach on monocular depth estimation, and compare to state-of-the-art methods. Performance on semantic labeling is also presented to show the benefits of knowledge sharing.

4.1 Experimental Setup

Dataset and Evaluation Metrics. We use the NYU Depth v2 (NYUD2) dataset [46] for our evaluation, which consists of 464 different indoor scenes with 894 different object categories (distributions shown in Fig. 1). We follow the standard train/test split with 795 aligned (RGB, depth) pairs for training, and 654 pairs for testing, as adopted in [35, 53, 56]. Besides, each of the standard splits images is manually annotated with semantic labels. In our experiment, we map the semantic labels into 4 and 40 categories, according to [46] and [10], respectively. We perform data augmentation on the training samples by random in-plane rotation (\([-5^ \circ , +5^ \circ ]\)), translation, horizontal flips, color (multiply with RGB value \(\in {[0.8,1.2]^3}\)) and contrast (multiply with value \(\in {[0.5,2.0]}\)) shift.

We quantitatively evaluate the performance of monocular depth prediction using the metrics of: mean absolute relative error (rel), mean \(\log _{10}\) error (\(\log {10}\)), root mean squared error (rms), rms(log), and the accuracy under threshold (\(\delta \!<\!1.25^i, i=1,2,3\)), following previous works [4, 9, 28, 51].

Implementation Details. We implement our proposed deep model on a single Nvidia Tesla K80 GPU, using the PyTorch [40] framework. In our final model, the ResNet-50 [14] pre-trained on ImageNet is taken as our shared backbone network, by removing the last classification layers. The structure of decoder layers are set following state-of-the-art designs [28, 53]. All the other parameters in the depth decoder, semantic decoder, SUCs, and LSUs are randomly initialized by the strategy in [13] and trained from scratch. We train our model with a batch size of 12 using the Adam solver [25] with parameters \((\beta _1,\beta _2,\epsilon )=(0.9,0.999,10^{-8})\). \(\alpha , \gamma \) are set with reference to [32]. The images are first down-sampled to half size with invalid borders cropped, and at the end up-sampled to the original size using techniques similar to previous works [4, 30, 35]. We first freeze the semantic branch with all the LSUs, and train the rest model for depth prediction with a learning rate of \(10^{-3}\). Then freeze the depth branch and train the rest with learning rate of \(10^{-5}\) on backbone and \(10^{-3}\) on semantic branch. Finally, the whole model is trained end-to-end with initial learning rate \(10^{-4}\) for backbone and \(10^{-2}\) for others. The learning rate is decreased by 10 times every 20 epochs.

4.2 Experimental Results

Architecture Analysis. We first compare different settings of the network architecture: depth-only branch, i.e. ResNet with up-convs; with the SUC; with our proposed depth-aware loss (\(L_{DA}\)); adding semantic branch with and without LSUs. To better illustrate the effectiveness of the proposed knowledge sharing strategy, we also include the CS structure [38] (substitutes LSU) for comparison. Our final method with the attention-driven loss is compared to these baselines. In this analysis the semantic labels are mapped to 4 categories. The comparison results are shown in Table 1, where we can see the performance is continuously improved by incorporating each term. Specifically, after introducing the proposed depth-aware loss, performance among all the metrics are improved by a large margin. We note the CS structure do benefits representation sharing while our LSU performs slightly better. The synergy boosting from semantic labeling task also benefits a lot to the depth estimation. To summarize, the attention-driven loss contributes most to the performance, with secondary contributions of knowledge sharing from semantic labeling.

Table 1. Architecture analysis. Results are shown on NYUD2 dataset with 4-category mapped as semantic labeling task
Table 2. Analysis on robustness to data “tail”. Study performed on NYUD2 with 4-category mapped semantic labels

Robustness to “Tail”. In order to validate the robustness of the proposed approach to long-tailed data, we perform an ablation study on the tailed part of the data. Specifically, we divide the depth range of the test data into four parts by cutting corresponding tails by 2 m for each (i.e., \({\le }4\) m, 6 m, 8 m, 10 m). Then we evaluate our method on these depth ranges as shown in Table 2. From the table we can see that even our attention-driven loss supervises the network to focus more on distant depth, it performs well on shorter-tailed data and consistently among different ranges, which indicates the proposed attention loss is able to adaptively vary according to the data distribution. In addition, our method also achieves state of the art even on nearby depth.

Table 3. Comparison with state-of-the-art methods on NYUD2 dataset. The last two rows show the proposed approach with 4 and 40 semantic categories, respectively

Comparison with State-of-the-Art. We also compare other state-of-the-art methods with the proposed approach. Here we directly use the reported results in their original papers. The comparison results on NYUD2 is shown in Table 3. For our approach, we consider two sharing settings with the semantic labeling task: sharing information from 4 mapped categories, and 40 mapped categories, as shown in the last two rows. From the results in Table 3 we can see, our approach performs favorably against other state-of-the-art methods. Note that [19, 39, 51] also utilize the semantic labeling information in a joint prediction manner, which perform not as well as ours. We also include a state-of-the-art method [28] accompanied a semantic labeling branch for better understanding of the semantic booster. The improvement over [28] favorably validates the effectiveness of adding semantic task, while information sharing is still underexplored. Another observation is that using more categories benefits the depth prediction, since it provides more semantic information of the objects in the scene.

Fig. 6.
figure 6

Qualitative results on the NYUD2 dataset. Our method predicts more accurate depth compared to other state-of-the-art methods, especially on distant regions. Depth maps are in the same range with ground truth. Warm color indicates large depth.

Table 4. Evaluation of semantic labeling on the NYUD2-40

In addition to the quantitative comparison, some qualitative results are also presented in Fig. 6. All the depth maps are shown in the same range with the ground truth for better comparison. As we can see in the figure, the proposed method predicts more accurate depth values compared to other methods. For instance, the large-depth (red) regions in these examples, and the wall region in the last example. Furthermore, semantic prior also benefits the depth prediction, e.g. the floor mat in the last example should have similar depth to the floor instead of floating. This again validates the effectiveness of the proposed approach, which focuses more on hard distant depth and object semantic meaning.

Semantic Labeling. Although the semantic labeling task is incorporated to perform knowledge sharing and boost the depth prediction task, the proposed network infers a semantic segmentation map as well. Here we evaluate whether the depth prediction task benefits semantic labeling, by three metrics in percentage (%): pixel accuracy, mean accuracy, Intersection over Union (IoU). We set the model without depth branch and \(L_{semF}\) as a baseline, and the model with \(L_{semF}\) (without depth) for comparison. Other semantic segmentation methods are also included for comparison (with their reported performance). The results on NYUD2 dataset with mapped 40 categories are shown in Table 4. As the table shows, our inferred semantic result achieves state-of-the-art performance as well. We note that without the depth information, our model still performs favorably against [4] and [37] which take RGB-D as input. This validates the effectiveness of the proposed SUC and \(L_{semF}\) to some extent. We also compare with [19, 51] which mapped the raw data to 5 categories, different from the standard 4-category. After fine-tuning our 4-category model on their data, we achieve a result of (87.11, 66.77) on (pix.acc., IoU), with respect to (70.29, 44.20) from [51] and (73.04, 54.27) from [19].

Fig. 7.
figure 7

Results on SUN. Some regions (white boxes) are difficult even to capture GT.

Generalization Analysis. In addition to the NYUD2 dataset, we further explore the generalization ability of our model to other indoor and outdoor scenes. Performance on another indoor dataset SUN-RGBD [48] is shown in Fig. 7, where Ours are predicted by our original model without finetuning on SUN. The results show that even SUN differs from NYU in data distribution, our model could predict plausible results. For outdoor scenes, we fine-tune the indoor model on 200 standard training images (with sparse depth and semantic labels) from the KITTI dataset [7]. The performance is (RMSE, RMSElog, \(\delta <1.25\)\(\delta <1.25^{2}\), \(\delta <1.25^{3}\)) = (5.110, 0.215, 0.843, 0.950, 0.981), following the evaluation setups in [9, 26]. We also evaluate on the Cityscapes dataset [2], following the setups in [23]. The (Mean Error, RMSE) on the converted disparity is (2.11, 4.92), in comparison to (2.92, 5.88) for [23]. The above evaluations reveal that despite the difference in distribution and scene structures, our model is shown to have the generalization ability to other datasets.

5 Conclusions

We have introduced an attention-driven learning approach for monocular depth estimation, which also predicts corresponding accurate semantic labels. In order to predict accurate depth information for the whole scene, we delve into the deeper part of the scene and propose a novel attention-driven loss that supervises the training in an attention-driven manner. We have also presented a sharing strategy with LSU and SUC, to better propagate both inter- and intra-task knowledge. Experimental results on NYUD2 dataset showed that the proposed method performs favorably against state-of-the-arts, especially on hard distant regions. We have also shown the generality of our model to other datasets/scenes.