Skip to main content
Log in

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The existing RGB-D scene recognition approaches typically employ two separate and modality-specific networks to learn effective RGB and Depth representations respectively. This independent training scheme fails to capture the correlation of two modalities, and thus may be suboptimal for RGB-D scene recognition. To address this issue, this paper proposes a general and flexible framework to enhance RGB-D representation learning with a customized cross-modal pyramid translation branch, coined as TRecgNet. This framework unifies the tasks of cross-modal translation and modality-specific recognition with a shared feature encoder, and aims at leveraging the correspondence between two modalities to regularize the representation learning of each modality. Specifically, we present a cross-modal pyramid translation strategy to perform multi-scale image generation with a carefully designed layer-wise perceptual supervision. To improve the complementarity of cross-modal translation to modality specific scene recognition, we devise a feature selection module to adaptively enhance the discriminative information during the translation procedure. In addition, we train multiple auxiliary classifiers to further regularize the behavior of generated data to be consistent with its paired data on label prediction. Meanwhile, our translation branch enables us to generate cross-modal data for training data augmentation and further improve single modality scene recognition. Extensive experiments on benchmarks of SUN RGB-D and NYU Depth V2 demonstrate the superiority of the proposed method to the state-of-the-art RGB-D scene recognition methods. We also generalize the TRecgNet to the single modality scene recognition benchmark of MIT Indoor, and automatically synthesize a depth view to boost the final recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., & Torralba, A. (2017). Cross-modal scene networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2303–2314.

    Article  Google Scholar 

  • Banica, D., & Sminchisescu, C. (2015). Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images. In IEEE conference on computer vision and pattern recognition, pp. 3517–3526.

  • Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 839–847). IEEE.

  • Chen, Y., Lai, Y. K., & Liu, Y. J. (2018). CartoonGAN: Generative adversarial networks for photo cartoonization. In IEEE conference on computer vision and pattern recognition, pp. 9465–9474.

  • Cheng, X., Lu, J., Feng, J., Yuan, B., & Zhou, J. (2018). Scene recognition with objectness. Pattern Recognition, 74, 474–487.

    Article  Google Scholar 

  • Christoudias, C. M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In European conference on computer vision (pp. 677–691). Springer.

  • Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.

    Article  MathSciNet  Google Scholar 

  • Du, D., Wang, L., Wang, H., Zhao, K., & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11836–11845.

  • Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In IEEE conference on computer vision and pattern recognition (pp. 2414–2423). IEEE.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.

  • Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. IJCV, 112(2), 133–149.

    Article  MathSciNet  Google Scholar 

  • Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer.

  • Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In IEEE conference on computer vision and pattern recognition, pp. 2827–2836.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision, pp. 2980–2988.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pp. 346–361. Springer.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778.

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637.

  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  • Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition.

  • Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision (pp. 141–165). Springer.

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.

  • Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In Proceedings of the IEEE international conference on computer vision, pp. 4163–4172.

  • Kapidis, G., Poppe, R., van Dam, E., Noldus, L., & Veltkamp, R. (2019). Multitask learning to improve egocentric action recognition. In Proceedings of the IEEE international conference on computer vision workshops.

  • Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491.

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105.

  • Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 fourth international conference on 3D Vision (3DV) (pp. 239–248). IEEE.

  • Li, T., & Wang, L. (2020). Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691

  • Li, Y., Zhang, J., Cheng, Y., Huang, K., & Tan, T. (2018). DF2Net: Discriminative feature learning and fusion network for RGB-D indoor scene classification. In AAAI.

  • Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400

  • Liu, Y., Feng, X., & Zhou, Z. (2016). Multimodal video classification with stacked contractive autoencoders. Signal Processing, 120, 761–766.

    Article  Google Scholar 

  • Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146.

  • Lvd, M., & Hinton, G. (2008). Visualizing data using T-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.

    MATH  Google Scholar 

  • McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In IEEE international conference on computer vision, Vol. 4.

  • Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003.

  • Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696.

  • Omeiza, D., Speakman, S., Cintas, C., & Weldermariam, K. (2019). Smooth Grad-CAM++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8024–8035.

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition (pp. 413–420). IEEE.

  • Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434

  • Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pp. 91–99.

  • Ren, Z., & Jae Lee, Y. (2018). Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 762–771.

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer.

  • Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626.

  • Shmelkov, K., Schmid, C., & Alahari, K. (2018). How good is my gan? In Proceedings of the European conference on computer vision (European Conference on Computer Vision), pp. 213–229.

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGB-D images. In European conference on computer vision (pp. 746–760). Springer.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556

  • Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943.

  • Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE conference on computer vision and pattern recognition (pp. 567–576). IEEE.

  • Song, X., Herranz, L., & Jiang, S. (2017). Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from RGB-CNNs. In Thirty-first AAAI conference on artificial intelligence.

  • Song, X., Jiang, S., Wang, B., Chen, C., & Chen, G. (2019). Image representations with spatial object-to-object relations for RGB-D scene recognition. IEEE Transactions on Image Processing, 29, 525–537.

    Article  MathSciNet  Google Scholar 

  • Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230.

  • Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 5229–5238.

  • team A (2019) https://github.com/ashrutkumar/indoor-scene-recognition, 3rd rank of kaggle challenge. Kaggle challenge.

  • Wang, A., Cai, J., Lu, J., & Cham, T. J. (2016). Modality and component aware feature fusion for RGB-D scene classification. In IEEE conference on computer vision and pattern recognition (pp. 5995–6004). IEEE.

  • Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 24–25.

  • Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017a). Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Transactions on Image Processing, 26(4), 2055–2068.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Li, W., Li, W., & Gool, L. V. (2018a). Appearance-and-relation networks for video classification. In IEEE conference on computer vision and pattern recognition, pp. 1430–1439.

  • Wang, L., Wang, Z., Qiao, Y., & Gool, L. V. (2018b). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126(2–4), 390–409.

    Article  MathSciNet  Google Scholar 

  • Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., & Ogunbona, P. (2017b). Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.

  • Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. In European conference on computer vision (pp. 318–335). Springer.

  • Wang, Z., Wang, L., Wang, Y., Zhang, B., & Qiao, Y. (2017c). Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing, 26(4), 2028–2041.

    Article  MathSciNet  Google Scholar 

  • Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using SFM and object labels. In IEEE international conference on computer vision (pp. 1625–1632). IEEE.

  • Xiong, Z., Yuan, Y., & Wang, Q. (2020). MSN: Modality separation networks for RGB-D scene recognition. Neurocomputing, 373, 81–89.

    Article  Google Scholar 

  • Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017a). Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5363–5371.

  • Xu, X., Li, Y., Wu, G., & Luo, J. (2017b). Multi-modal dep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.

    Article  Google Scholar 

  • Yuan, Y., Xiong, Z., & Wang, Q. (2019). ACM: Adaptive cross-modal graph convolutional neural networks for RGB-D scene recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9176–9184.

  • Zhang, J., Li, W., Ogunbona, P. O., Wang, P., & Tang, C. (2016a). RGB-D-based action recognition datasets: A survey. Pattern Recognition, 60, 86–105.

    Article  Google Scholar 

  • Zhang, R., Isola, P., & Efros, A. A. (2016b). Colorful image colorization. In European conference on computer vision (pp. 649–666). Springer.

  • Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108). Springer.

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495.

  • Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. In PAMI.

  • Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In IEEE conference on computer vision and pattern recognition (pp. 2969–2976). IEEE.

Download references

Acknowledgements

This work is supported by the National Science Foundation of China (No. 62076119, No. 61921006), Program for Innovative Talents and Entrepreneur in Jiangsu Province, and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Limin Wang.

Additional information

Communicated by Frederic Jurie.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, D., Wang, L., Li, Z. et al. Cross-Modal Pyramid Translation for RGB-D Scene Recognition. Int J Comput Vis 129, 2309–2327 (2021). https://doi.org/10.1007/s11263-021-01475-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01475-7

Keywords

Navigation