Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer

Zhou, Xinyuan; Lan, Shiyong; Wang, Wenwu; Li, Xinyang; Zhou, Siyuan; Yang, Hongyu

doi:10.1007/978-3-031-44195-0_20

Xinyuan Zhou¹¹,
Shiyong Lan¹¹,
Wenwu Wang¹²,
Xinyang Li¹¹,
Siyuan Zhou¹¹ &
…
Hongyu Yang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

International Conference on Artificial Neural Networks

1107 Accesses

Abstract

Humans recognize objects by combining multi-sensory information in a coordinated fashion. However, visual-based and haptic-based object recognition remain two separate research directions in robotics. Visual images and haptic time series have different properties, which can be difficult for robots to fuse for object recognition as humans do. In this work, we propose an architecture to fuse visual, haptic and kinesthetic data for object recognition, based on the multimodal Convolutional Recurrent Neural Networks with Transformer. We use Convolutional Neural Networks (CNNs) to learn spatial representation, Recurrent Neural Networks (RNNs) to model temporal relationships, and Transformer’s self-attention and cross-attention structures to focus on global and cross-modal information. We propose two fusion methods and conduct experiments on the multimodal AU dataset. The results show that our model offers higher accuracy than the latest multimodal object recognition methods. We conduct an ablation study on the individual components of the inputs to demonstrate the importance of multimodal information in object recognition. The codes will be available at https://github.com/SYLan2019/VHKOR.

This work was funded by 2035 Innovation Pilot Program of Sichuan University, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allen, P.K.: Surface descriptions from vision and touch. In: IEEE International Conference on Robotics & Automation, pp. 394–397 (1984)
Google Scholar
Allen, P.K.: Integrating Vision and Touch for Object Recognition Tasks, pp. 407–440. Ablex Publishing Corp., USA (1995)
Google Scholar
Bednarek, M., Kicki, P., Walas, K.: On robustness of multi-modal fusion-robotics perspective. Electronics 9, 1152 (2020)
Google Scholar
Bonner, L.E.R., Buhl, D.D., Kristensen, K., Navarro-Guerrero, N.: Au dataset for visuo-haptic object recognition for robots (2021)
Google Scholar
Cao, G., Zhou, Y., Bollegala, D., Luo, S.: Spatio-temporal attention model for tactile texture recognition. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9896–9902 (2020)
Google Scholar
Chen, Y., Sipos, A., Van der Merwe, M., Fazeli, N.: Visuo-tactile transformers for manipulation. In: 2022 Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 205, pp. 2026–2040 (2022)
Google Scholar
Chu, V., et al.: Robotic learning of haptic adjectives through physical interaction. Robot. Auton. Syst. 63, 279–292 (2015)
Article Google Scholar
Cui, S., Wei, J., Li, X., Wang, R., Wang, S.: Generalized visual-tactile transformer network for slip detection. IFAC-PapersOnLine 53(2), 9529–9534 (2020)
Article Google Scholar
Cui, S., Wang, R., Wei, J., Hu, J., Wang, S.: Self-attention based visual-tactile fusion learning for predicting grasp outcomes. IEEE Robot. Autom. Lett. 5(4), 5827–5834 (2020)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020)
Google Scholar
Fanello, S.R., Ciliberto, C., Noceti, N., Metta, G., Odone, F.: Visual recognition for humanoid robots. Robot. Auton. Syst. 91, 151–168 (2017)
Article Google Scholar
Gao, Y., Hendricks, L.A., Kuchenbecker, K.J., Darrell, T.: Deep learning for tactile understanding from visual and haptic data (2015)
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks Official J. Int. Neural Network Soc. 18, 602–10 (2005)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, December 2014
Google Scholar
Le, M., Rathour, V., Truong, Q., Mai, Q., Brijesh, P., Le, N.: Multi-module recurrent convolutional neural network with transformer encoder for ECG arrhythmia classification, pp. 1–5 (2021)
Google Scholar
Liu, H., Yu, Y., Sun, F., Gu, J.: Visual-tactile fusion for object recognition. IEEE Trans. Autom. Sci. Eng. 14(2), 996–1008 (2017)
Article Google Scholar
Luo, S., Yuan, W., Adelson, E., Cohn, A.G., Fuentes, R.: Vitac: feature sharing between vision and tactile sensing for cloth texture recognition. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2722–2727 (2018)
Google Scholar
Strese, M., Brudermueller, L., Kirsch, J., Steinbach, E.: Haptic material analysis and classification inspired by human exploratory procedures. IEEE Trans. Haptics 13(2), 404–424 (2020)
Article Google Scholar
Sun, F., Liu, C., Huang, W., Zhang, J.: Object classification and grasp planning using visual and tactile sensing. IEEE Trans. Syst. Man Cybern. Syst. 46(7), 969–979 (2016)
Article Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Los Alamitos, CA, USA, June 2016
Google Scholar
Tatiya, G., Sinapov, J.: Deep multi-sensory object category recognition using interactive behavioral exploration. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7872–7878 (2019)
Google Scholar
Toprak, S., Navarro-Guerrero, N., Wermter, S.: Evaluating integration strategies for visuo-haptic object recognition. Cognitive Comput. 10, 408–425 (2018)
Google Scholar
Tsai, Y.H., Bai, S., Liang, P., Kolter, J., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences, vol. 2019, pp. 6558–6569, July 2019
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran Associates Inc., Red Hook (2017)
Google Scholar
Yang, J., Liu, H., Sun, F., Gao, M.: Object recognition using tactile and image information. In: 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1746–1751 (2015)
Google Scholar
Zhang, P., Zhou, M., Shan, D., Chen, Z., Wang, X.: Object description using visual and tactile data. IEEE Access 10, 54525–54536 (2022)
Article Google Scholar
Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Networks Learn. Syst. 30(11), 3212–3232 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, 610065, China
Xinyuan Zhou, Shiyong Lan, Xinyang Li, Siyuan Zhou & Hongyu Yang
University of Surrey, Guildford, GU2 7XH, UK
Wenwu Wang

Authors

Xinyuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shiyong Lan
View author publications
You can also search for this author in PubMed Google Scholar
Wenwu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiyong Lan .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Lan, S., Wang, W., Li, X., Zhou, S., Yang, H. (2023). Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-44195-0_20
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer