Skip to main content

Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

  • 1107 Accesses

Abstract

Humans recognize objects by combining multi-sensory information in a coordinated fashion. However, visual-based and haptic-based object recognition remain two separate research directions in robotics. Visual images and haptic time series have different properties, which can be difficult for robots to fuse for object recognition as humans do. In this work, we propose an architecture to fuse visual, haptic and kinesthetic data for object recognition, based on the multimodal Convolutional Recurrent Neural Networks with Transformer. We use Convolutional Neural Networks (CNNs) to learn spatial representation, Recurrent Neural Networks (RNNs) to model temporal relationships, and Transformer’s self-attention and cross-attention structures to focus on global and cross-modal information. We propose two fusion methods and conduct experiments on the multimodal AU dataset. The results show that our model offers higher accuracy than the latest multimodal object recognition methods. We conduct an ablation study on the individual components of the inputs to demonstrate the importance of multimodal information in object recognition. The codes will be available at https://github.com/SYLan2019/VHKOR.

This work was funded by 2035 Innovation Pilot Program of Sichuan University, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allen, P.K.: Surface descriptions from vision and touch. In: IEEE International Conference on Robotics & Automation, pp. 394–397 (1984)

    Google Scholar 

  2. Allen, P.K.: Integrating Vision and Touch for Object Recognition Tasks, pp. 407–440. Ablex Publishing Corp., USA (1995)

    Google Scholar 

  3. Bednarek, M., Kicki, P., Walas, K.: On robustness of multi-modal fusion-robotics perspective. Electronics 9, 1152 (2020)

    Google Scholar 

  4. Bonner, L.E.R., Buhl, D.D., Kristensen, K., Navarro-Guerrero, N.: Au dataset for visuo-haptic object recognition for robots (2021)

    Google Scholar 

  5. Cao, G., Zhou, Y., Bollegala, D., Luo, S.: Spatio-temporal attention model for tactile texture recognition. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9896–9902 (2020)

    Google Scholar 

  6. Chen, Y., Sipos, A., Van der Merwe, M., Fazeli, N.: Visuo-tactile transformers for manipulation. In: 2022 Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 205, pp. 2026–2040 (2022)

    Google Scholar 

  7. Chu, V., et al.: Robotic learning of haptic adjectives through physical interaction. Robot. Auton. Syst. 63, 279–292 (2015)

    Article  Google Scholar 

  8. Cui, S., Wei, J., Li, X., Wang, R., Wang, S.: Generalized visual-tactile transformer network for slip detection. IFAC-PapersOnLine 53(2), 9529–9534 (2020)

    Article  Google Scholar 

  9. Cui, S., Wang, R., Wei, J., Hu, J., Wang, S.: Self-attention based visual-tactile fusion learning for predicting grasp outcomes. IEEE Robot. Autom. Lett. 5(4), 5827–5834 (2020)

    Article  Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020)

    Google Scholar 

  12. Fanello, S.R., Ciliberto, C., Noceti, N., Metta, G., Odone, F.: Visual recognition for humanoid robots. Robot. Auton. Syst. 91, 151–168 (2017)

    Article  Google Scholar 

  13. Gao, Y., Hendricks, L.A., Kuchenbecker, K.J., Darrell, T.: Deep learning for tactile understanding from visual and haptic data (2015)

    Google Scholar 

  14. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks Official J. Int. Neural Network Soc. 18, 602–10 (2005)

    Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014

    Google Scholar 

  17. Le, M., Rathour, V., Truong, Q., Mai, Q., Brijesh, P., Le, N.: Multi-module recurrent convolutional neural network with transformer encoder for ECG arrhythmia classification, pp. 1–5 (2021)

    Google Scholar 

  18. Liu, H., Yu, Y., Sun, F., Gu, J.: Visual-tactile fusion for object recognition. IEEE Trans. Autom. Sci. Eng. 14(2), 996–1008 (2017)

    Article  Google Scholar 

  19. Luo, S., Yuan, W., Adelson, E., Cohn, A.G., Fuentes, R.: Vitac: feature sharing between vision and tactile sensing for cloth texture recognition. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2722–2727 (2018)

    Google Scholar 

  20. Strese, M., Brudermueller, L., Kirsch, J., Steinbach, E.: Haptic material analysis and classification inspired by human exploratory procedures. IEEE Trans. Haptics 13(2), 404–424 (2020)

    Article  Google Scholar 

  21. Sun, F., Liu, C., Huang, W., Zhang, J.: Object classification and grasp planning using visual and tactile sensing. IEEE Trans. Syst. Man Cybern. Syst. 46(7), 969–979 (2016)

    Article  Google Scholar 

  22. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Los Alamitos, CA, USA, June 2016

    Google Scholar 

  23. Tatiya, G., Sinapov, J.: Deep multi-sensory object category recognition using interactive behavioral exploration. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7872–7878 (2019)

    Google Scholar 

  24. Toprak, S., Navarro-Guerrero, N., Wermter, S.: Evaluating integration strategies for visuo-haptic object recognition. Cognitive Comput. 10, 408–425 (2018)

    Google Scholar 

  25. Tsai, Y.H., Bai, S., Liang, P., Kolter, J., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences, vol. 2019, pp. 6558–6569, July 2019

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran Associates Inc., Red Hook (2017)

    Google Scholar 

  27. Yang, J., Liu, H., Sun, F., Gao, M.: Object recognition using tactile and image information. In: 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1746–1751 (2015)

    Google Scholar 

  28. Zhang, P., Zhou, M., Shan, D., Chen, Z., Wang, X.: Object description using visual and tactile data. IEEE Access 10, 54525–54536 (2022)

    Article  Google Scholar 

  29. Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Networks Learn. Syst. 30(11), 3212–3232 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiyong Lan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, X., Lan, S., Wang, W., Li, X., Zhou, S., Yang, H. (2023). Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44195-0_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44194-3

  • Online ISBN: 978-3-031-44195-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics