Skip to main content
Log in

MSANet: multimodal self-augmentation and adversarial network for RGB-D object recognition

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

This paper researches on the problem of object recognition using RGB-D data. Although deep convolutional neural networks have so far made progress in this area, they are still suffering a lot from lack of large-scale manually labeled RGB-D data. Labeling large-scale RGB-D dataset is a time-consuming and boring task. More importantly, such large-scale datasets often exist a long tail, and those hard positive examples of the tail can hardly be recognized. To solve these problems, we propose a multimodal self-augmentation and adversarial network (MSANet) for RGB-D object recognition, which can augment the data effectively at two levels while keeping the annotations. Toward the first level, series of transformations are leveraged to generate class-agnostic examples for each instance, which supports the training of our MSANet. Toward the second level, an adversarial network is proposed to generate class-specific hard positive examples while learning to classify them correctly to further improve the performance of our MSANet. Via the above schemes, the proposed approach wins the best results on several available RGB-D object recognition datasets, e.g., our experimental results indicate a 1.5% accuracy boost on benchmark Washington RGB-D object dataset compared with the current state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Bai, J., Wu, Y.: SAE-RNN deep learning for RGB-d based object recognition. In: International Conference on Intelligent Computing, Springer, pp. 235–240 (2014)

  2. Blum, M., Springenberg, J.T., Wülfing, J., Riedmiller, M.: A learned feature descriptor for object recognition in RGB-d data. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 1298–1303 (2012)

  3. Bo, L., Ren, X., Fox, D.: Depth kernel descriptors for object recognition. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 821–826 (2011)

  4. Bo, L., Ren, X., Fox, D.: Hierarchical matching pursuit for image classification: architecture and fast algorithms. In: Advances in Neural Information Processing Systems, pp. 2115–2123 (2011)

  5. Bo, L., Ren, X., Fox, D.: Unsupervised Feature Learning for RGB-D Based Object Recognition. Springer, Berlin (2013)

    Book  Google Scholar 

  6. Browatzki, B., Fischer, J., Graf, B., Bülthoff, H.H., Wallraven, C.: Going into depth: Evaluating 2d and 3d cues for object classification on a new, large-scale object dataset. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), IEEE, pp. 1189–1195 (2011)

  7. Cheng, Y., Cai, R., Zhang, C., Li, Z., Zhao, X., Huang, K., Rui, Y.: Query adaptive similarity measure for RGB-d object recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 145–153 (2015)

  8. Cheng, Y., Cai, R., Zhao, X., Huang, K.: Convolutional fisher kernels for RGB-d object recognition. In: 2015 International Conference on 3D Vision (3DV), IEEE, pp. 135–143 (2015)

  9. Cheng, Y., Zhao, X., Cai, R., Li, Z., Huang, K., Rui, Y.: Semi-supervised multimodal deep learning for RGB-d object recognition. In: International Joint Conference on Artificial Intelligence, pp. 3345–3351 (2016)

  10. Cheng, Y., Zhao, X., Huang, K., Tan, T.: Semi-supervised learning for RGB-d object recognition. In: 2014 22nd International Conference on Pattern Recognition (ICPR), IEEE, pp. 2377–2382 (2014)

  11. Cheng, Y., Zhao, X., Huang, K., Tan, T.: Semi-supervised learning and feature evaluation for RGB-D object recognition. Comput. Vis. Image Underst. 139, 149–160 (2015)

    Article  Google Scholar 

  12. Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3642–3649 (2012)

  13. Cireşan, D.C., Meier, U., Masci, J., Gambardella, L.M., Schmidhuber, J.: High-performance neural networks for visual object classification. arXiv preprint arXiv:1102.0183 (2011)

  14. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, IEEE, pp. 248–255 (2009)

  15. Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in neural information processing systems, pp. 1486–1494 (2015)

  16. Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1734–1747 (2016)

    Article  Google Scholar 

  17. Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust rgb-d object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 681–687 (2015)

  18. Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)

  19. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

  20. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: European Conference on Computer Vision, Springer, pp. 345–360 (2014)

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  22. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  23. Jhuo, I.H., Gao, S., Zhuang, L., Lee, D., Ma, Y.: Unsupervised feature learning for RGB-D image classification. In: Asian Conference on Computer Vision, Springer, pp. 276–289 (2014)

  24. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, ACM, pp. 675–678 (2014)

  25. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)

    Article  Google Scholar 

  26. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)

  27. Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 1817–1824 (2011)

  28. Lai, K., Bo, L., Ren, X., Fox, D.: Sparse distance learning for object recognition combining rgb and depth information. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 4007–4013 (2011)

  29. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  30. Lin, S., Chen, Y., Lai, Y.K., Martin, R.R., Cheng, Z.Q.: Fast capture of textured full-body avatar with RGB-D cameras. Vis. Comput. 32(6–8), 681–691 (2016)

    Article  Google Scholar 

  31. Loshchilov, I., Hutter, F.: Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343 (2015)

  32. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  33. Mirza, M., Osindero, S.: Conditional generative adversarial nets. Comput. Sci. 3, 2672–2680 (2014)

    Google Scholar 

  34. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

  35. Rasool, S., Sourin, A.: Real-time haptic interaction with rgbd video streams. Vis. Comput. 32(10), 1311–1321 (2016)

    Article  Google Scholar 

  36. Redondo-Cabrera, C., López-Sastre, R.J., Acevedo-Rodriguez, J., Maldonado-Bascón, S.: Surfing the point clouds: Selective 3d spatial pyramids for category-level object recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3458–3465 (2012)

  37. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)

  38. Schwarz, M., Schulz, H., Behnke, S.: Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 1329–1335 (2015)

  39. Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)

  40. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  41. Socher, R., Huval, B., Bath, B., Manning, C.D., Ng, A.Y.: Convolutional-recursive deep learning for 3d object classification. In: Advances in Neural Information Processing Systems, pp. 656–664 (2012)

  42. Song, X., Zhong, F., Wang, Y., Qin, X.: Estimation of kinect depth confidence through self-training. Vis. Comput. 30(6–8), 855–865 (2014)

    Article  Google Scholar 

  43. Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In: Advances in Neural Information Processing Systems, pp. 2377–2385 (2015)

  44. Takác, M., Bijral, A.S., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMS. In: ICML, vol. 3, pp. 1022–1030 (2013)

  45. Tang, Y., Tong, R., Tang, M., Zhang, Y.: Depth incorporating with color improves salient object detection. Vis. Comput. 32(1), 111–121 (2016)

    Article  Google Scholar 

  46. Wang, A., Cai, J., Lu, J., Cham, T.J.: Mmss: Multi-modal sharable and specific feature learning for RGB-D object recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1125–1133 (2015)

  47. Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802 (2015)

  48. Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation via adversary for object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  49. Wei, Y., Feng, J., Liang, X., Cheng, M.M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: IEEE CVPR (2017)

  50. Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. In: AAAI, pp. 2852–2858 (2017)

Download references

Acknowledgements

This work is supported in part by MS-RA CCRP Funding FY16-RES-THEME-039. The authors thank all the anonymous reviewers for their very helpful comments to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xukun Shen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, F., Hu, Y. & Shen, X. MSANet: multimodal self-augmentation and adversarial network for RGB-D object recognition. Vis Comput 35, 1583–1594 (2019). https://doi.org/10.1007/s00371-018-1559-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-018-1559-x

Keywords

Navigation