Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization

Benitez-Garcia, Gibran; Takahashi, Hiroki

doi:10.1007/978-981-97-0376-0_15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14403))

Included in the following conference series:

Pacific-Rim Symposium on Image and Video Technology

314 Accesses

Abstract

Hand Gesture Recognition (HGR) has been significantly advanced through multimodal approaches utilizing RGB and Optical Flow (OF). Yet, two main challenges often remain (i) The computational burden triggered by advanced techniques which rely on intricate multi-level fusion blocks distributed across the architecture, and (ii) the limited exploration into the impact of OF estimators on multimodal fusion. To address these, this paper introduces an efficient RGB+OF fusion relying on just a few 3DConv layers applied early in the architecture. Concurrently, we explore the impact of five state-of-the-art OF methods on this fusion. Advancing beyond traditional HGR, we prioritize recognizing and precisely localizing the hand gesture, which is critical for a wide range of computer vision applications. Thus transitioning the focus to Hand Gesture Recognition and Localization (HGRL). Accordingly, we employ a YOLO-based architecture renowned for its real-time efficacy and precision in object localization, aligning with the demands of dynamic gestures often seen in HGRL. We evaluate our approach with the IPN-Hand dataset, augmenting its scope for HGRL evaluation by manually annotating 82,769 frames. Our experiments show significant results of 10% enhancement in mAP against the RGB-only method and a 7% gain over 2DConv-based fusion.

Supported by a Research Grant (S) at Tateisi Science and Technology Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

“FabDepth I: A Unique Dataset for Efficient Gesture Detection”

Article 13 June 2023

mXception and dynamic image for hand gesture recognition

Article 17 February 2024

Motion feature estimation using bi-directional GRU for skeleton-based dynamic hand gesture recognition

Article 07 April 2024

References

Asadi-Aghbolaghi, M., et al.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 476–483. IEEE (2017)
Google Scholar
Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez, G., Yanai, K.: IPN hand: a video dataset and benchmark for real-time continuous hand gesture recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347. IEEE (2021)
Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 107, 3–11 (2018)
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Google Scholar
Guo, L., Lu, Z., Yao, L.: Human-machine interaction sensing technology based on hand gesture recognition: a review. IEEE Trans. Hum.-Mach. Syst. 51(4), 300–309 (2021)
Article Google Scholar
Hampiholi, B., Jarvers, C., Mader, W., Neumann, H.: Convolutional transformer fusion blocks for multi-modal gesture recognition. IEEE Access 11, 34094–34103 (2023)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Article Google Scholar
Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13677. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_40
Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9772–9781 (2021)
Google Scholar
Jocher, G., et al.: ultralytics/yolov5: v3.0. https://github.com/ultralytics/yolov5/. Accessed 15 Feb 2023
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)
Google Scholar
Kopuklu, O., Kose, N., Rigoll, G.: Motion fused frames: data level fusion strategy for hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2103–2111 (2018)
Google Scholar
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)
Google Scholar
Luo, A., Yang, F., Li, X., Liu, S.: Learning optical flow with kernel patch attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8906–8915 (2022)
Google Scholar
Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7 (2015)
Google Scholar
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)
Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Roitberg, A., Pollert, T., Haurilet, M., Martin, M., Stiefelhagen, R.: Analysis of deep fusion strategies for multi-modal gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: SKFlow: learning optical flow with super kernels. Adv. Neural. Inf. Process. Syst. 35, 11313–11326 (2022)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Wachs, J.P., Kölsch, M., Stern, H., Edan, Y.: Vision-based hand-gesture applications. Commun. ACM 54(2), 60–71 (2011)
Article Google Scholar
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)
Google Scholar
Zengeler, N., Kopinski, T., Handmann, U.: Hand gesture recognition in automotive human-machine interaction using depth cameras. Sensors 19(1), 59 (2019)
Article Google Scholar
Zhang, W., Wang, J., Lan, F.: Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Automatica Sin. 8(1), 110–120 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, Japan
Gibran Benitez-Garcia & Hiroki Takahashi
Artificial Intelligence eXploration Research Center (AIX), Meta-Networking Research Center (MEET), The University of Electro-Communications, Tokyo, Japan
Hiroki Takahashi

Authors

Gibran Benitez-Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Takahashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gibran Benitez-Garcia .

Editor information

Editors and Affiliations

Auckland University of Technology, Auckland, New Zealand
Wei Qi Yan
Auckland University of Technology, Auckland, New Zealand
Minh Nguyen
Auckland University of Technology, Auckland, New Zealand
Parma Nand
Auckland University of Technology, Auckland, New Zealand
Xuejun Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benitez-Garcia, G., Takahashi, H. (2024). Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization. In: Yan, W.Q., Nguyen, M., Nand, P., Li, X. (eds) Image and Video Technology. PSIVT 2023. Lecture Notes in Computer Science, vol 14403. Springer, Singapore. https://doi.org/10.1007/978-981-97-0376-0_15

Download citation

DOI: https://doi.org/10.1007/978-981-97-0376-0_15
Published: 12 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0375-3
Online ISBN: 978-981-97-0376-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

“FabDepth I: A Unique Dataset for Efficient Gesture Detection”

mXception and dynamic image for hand gesture recognition

Motion feature estimation using bi-directional GRU for skeleton-based dynamic hand gesture recognition

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

“FabDepth I: A Unique Dataset for Efficient Gesture Detection”

mXception and dynamic image for hand gesture recognition

Motion feature estimation using bi-directional GRU for skeleton-based dynamic hand gesture recognition

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation