Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Du, Dapeng; Wang, Limin; Li, Zhaoyang; Wu, Gangshan

doi:10.1007/s11263-021-01475-7

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Published: 18 May 2021

Volume 129, pages 2309–2327, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Dapeng Du¹,
Limin Wang ORCID: orcid.org/0000-0002-3674-7718¹,
Zhaoyang Li¹ &
…
Gangshan Wu¹

1168 Accesses
3 Citations
Explore all metrics

Abstract

The existing RGB-D scene recognition approaches typically employ two separate and modality-specific networks to learn effective RGB and Depth representations respectively. This independent training scheme fails to capture the correlation of two modalities, and thus may be suboptimal for RGB-D scene recognition. To address this issue, this paper proposes a general and flexible framework to enhance RGB-D representation learning with a customized cross-modal pyramid translation branch, coined as TRecgNet. This framework unifies the tasks of cross-modal translation and modality-specific recognition with a shared feature encoder, and aims at leveraging the correspondence between two modalities to regularize the representation learning of each modality. Specifically, we present a cross-modal pyramid translation strategy to perform multi-scale image generation with a carefully designed layer-wise perceptual supervision. To improve the complementarity of cross-modal translation to modality specific scene recognition, we devise a feature selection module to adaptively enhance the discriminative information during the translation procedure. In addition, we train multiple auxiliary classifiers to further regularize the behavior of generated data to be consistent with its paired data on label prediction. Meanwhile, our translation branch enables us to generate cross-modal data for training data augmentation and further improve single modality scene recognition. Extensive experiments on benchmarks of SUN RGB-D and NYU Depth V2 demonstrate the superiority of the proposed method to the state-of-the-art RGB-D scene recognition methods. We also generalize the TRecgNet to the single modality scene recognition benchmark of MIT Indoor, and automatically synthesize a depth view to boost the final recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RGB-D Scene Classification via Multi-modal Feature Learning

Article 02 August 2018

Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Adaptive Visual-Depth Fusion Transfer

References

Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., & Torralba, A. (2017). Cross-modal scene networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2303–2314.
Article Google Scholar
Banica, D., & Sminchisescu, C. (2015). Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images. In IEEE conference on computer vision and pattern recognition, pp. 3517–3526.
Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 839–847). IEEE.
Chen, Y., Lai, Y. K., & Liu, Y. J. (2018). CartoonGAN: Generative adversarial networks for photo cartoonization. In IEEE conference on computer vision and pattern recognition, pp. 9465–9474.
Cheng, X., Lu, J., Feng, J., Yuan, B., & Zhou, J. (2018). Scene recognition with objectness. Pattern Recognition, 74, 474–487.
Article Google Scholar
Christoudias, C. M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In European conference on computer vision (pp. 677–691). Springer.
Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.
Article MathSciNet Google Scholar
Du, D., Wang, L., Wang, H., Zhao, K., & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11836–11845.
Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In IEEE conference on computer vision and pattern recognition (pp. 2414–2423). IEEE.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. IJCV, 112(2), 133–149.
Article MathSciNet Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer.
Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In IEEE conference on computer vision and pattern recognition, pp. 2827–2836.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision, pp. 2980–2988.
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pp. 346–361. Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition.
Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision (pp. 141–165). Springer.
Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In Proceedings of the IEEE international conference on computer vision, pp. 4163–4172.
Kapidis, G., Poppe, R., van Dam, E., Noldus, L., & Veltkamp, R. (2019). Multitask learning to improve egocentric action recognition. In Proceedings of the IEEE international conference on computer vision workshops.
Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105.
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 fourth international conference on 3D Vision (3DV) (pp. 239–248). IEEE.
Li, T., & Wang, L. (2020). Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691
Li, Y., Zhang, J., Cheng, Y., Huang, K., & Tan, T. (2018). DF2Net: Discriminative feature learning and fusion network for RGB-D indoor scene classification. In AAAI.
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400
Liu, Y., Feng, X., & Zhou, Z. (2016). Multimodal video classification with stacked contractive autoencoders. Signal Processing, 120, 761–766.
Article Google Scholar
Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146.
Lvd, M., & Hinton, G. (2008). Visualizing data using T-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.
MATH Google Scholar
McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In IEEE international conference on computer vision, Vol. 4.
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696.
Omeiza, D., Speakman, S., Cintas, C., & Weldermariam, K. (2019). Smooth Grad-CAM++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8024–8035.
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition (pp. 413–420). IEEE.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434
Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pp. 91–99.
Ren, Z., & Jae Lee, Y. (2018). Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 762–771.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626.
Shmelkov, K., Schmid, C., & Alahari, K. (2018). How good is my gan? In Proceedings of the European conference on computer vision (European Conference on Computer Vision), pp. 213–229.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGB-D images. In European conference on computer vision (pp. 746–760). Springer.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943.
Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE conference on computer vision and pattern recognition (pp. 567–576). IEEE.
Song, X., Herranz, L., & Jiang, S. (2017). Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from RGB-CNNs. In Thirty-first AAAI conference on artificial intelligence.
Song, X., Jiang, S., Wang, B., Chen, C., & Chen, G. (2019). Image representations with spatial object-to-object relations for RGB-D scene recognition. IEEE Transactions on Image Processing, 29, 525–537.
Article MathSciNet Google Scholar
Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230.
Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 5229–5238.
team A (2019) https://github.com/ashrutkumar/indoor-scene-recognition, 3rd rank of kaggle challenge. Kaggle challenge.
Wang, A., Cai, J., Lu, J., & Cham, T. J. (2016). Modality and component aware feature fusion for RGB-D scene classification. In IEEE conference on computer vision and pattern recognition (pp. 5995–6004). IEEE.
Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 24–25.
Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017a). Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Transactions on Image Processing, 26(4), 2055–2068.
Article MathSciNet Google Scholar
Wang, L., Li, W., Li, W., & Gool, L. V. (2018a). Appearance-and-relation networks for video classification. In IEEE conference on computer vision and pattern recognition, pp. 1430–1439.
Wang, L., Wang, Z., Qiao, Y., & Gool, L. V. (2018b). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126(2–4), 390–409.
Article MathSciNet Google Scholar
Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., & Ogunbona, P. (2017b). Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.
Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. In European conference on computer vision (pp. 318–335). Springer.
Wang, Z., Wang, L., Wang, Y., Zhang, B., & Qiao, Y. (2017c). Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing, 26(4), 2028–2041.
Article MathSciNet Google Scholar
Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using SFM and object labels. In IEEE international conference on computer vision (pp. 1625–1632). IEEE.
Xiong, Z., Yuan, Y., & Wang, Q. (2020). MSN: Modality separation networks for RGB-D scene recognition. Neurocomputing, 373, 81–89.
Article Google Scholar
Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017a). Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5363–5371.
Xu, X., Li, Y., Wu, G., & Luo, J. (2017b). Multi-modal dep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.
Article Google Scholar
Yuan, Y., Xiong, Z., & Wang, Q. (2019). ACM: Adaptive cross-modal graph convolutional neural networks for RGB-D scene recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9176–9184.
Zhang, J., Li, W., Ogunbona, P. O., Wang, P., & Tang, C. (2016a). RGB-D-based action recognition datasets: A survey. Pattern Recognition, 60, 86–105.
Article Google Scholar
Zhang, R., Isola, P., & Efros, A. A. (2016b). Colorful image colorization. In European conference on computer vision (pp. 649–666). Springer.
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108). Springer.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. In PAMI.
Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In IEEE conference on computer vision and pattern recognition (pp. 2969–2976). IEEE.

Download references

Acknowledgements

This work is supported by the National Science Foundation of China (No. 62076119, No. 61921006), Program for Innovative Talents and Entrepreneur in Jiangsu Province, and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Dapeng Du, Limin Wang, Zhaoyang Li & Gangshan Wu

Authors

Dapeng Du
View author publications
You can also search for this author in PubMed Google Scholar
Limin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Gangshan Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Limin Wang.

Additional information

Communicated by Frederic Jurie.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, D., Wang, L., Li, Z. et al. Cross-Modal Pyramid Translation for RGB-D Scene Recognition. Int J Comput Vis 129, 2309–2327 (2021). https://doi.org/10.1007/s11263-021-01475-7

Download citation

Received: 01 July 2020
Accepted: 30 April 2021
Published: 18 May 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11263-021-01475-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Abstract

Access this article

Similar content being viewed by others

RGB-D Scene Classification via Multi-modal Feature Learning

Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Adaptive Visual-Depth Fusion Transfer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Abstract

Access this article

Similar content being viewed by others

RGB-D Scene Classification via Multi-modal Feature Learning

Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Adaptive Visual-Depth Fusion Transfer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation