Skip to main content

Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization

  • Conference paper
  • First Online:
Image and Video Technology (PSIVT 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14403))

Included in the following conference series:

  • 314 Accesses

Abstract

Hand Gesture Recognition (HGR) has been significantly advanced through multimodal approaches utilizing RGB and Optical Flow (OF). Yet, two main challenges often remain (i) The computational burden triggered by advanced techniques which rely on intricate multi-level fusion blocks distributed across the architecture, and (ii) the limited exploration into the impact of OF estimators on multimodal fusion. To address these, this paper introduces an efficient RGB+OF fusion relying on just a few 3DConv layers applied early in the architecture. Concurrently, we explore the impact of five state-of-the-art OF methods on this fusion. Advancing beyond traditional HGR, we prioritize recognizing and precisely localizing the hand gesture, which is critical for a wide range of computer vision applications. Thus transitioning the focus to Hand Gesture Recognition and Localization (HGRL). Accordingly, we employ a YOLO-based architecture renowned for its real-time efficacy and precision in object localization, aligning with the demands of dynamic gestures often seen in HGRL. We evaluate our approach with the IPN-Hand dataset, augmenting its scope for HGRL evaluation by manually annotating 82,769 frames. Our experiments show significant results of 10% enhancement in mAP against the RGB-only method and a 7% gain over 2DConv-based fusion.

Supported by a Research Grant (S) at Tateisi Science and Technology Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Asadi-Aghbolaghi, M., et al.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 476–483. IEEE (2017)

    Google Scholar 

  2. Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez, G., Yanai, K.: IPN hand: a video dataset and benchmark for real-time continuous hand gesture recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347. IEEE (2021)

    Google Scholar 

  3. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44

    Chapter  Google Scholar 

  4. Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 107, 3–11 (2018)

    Article  Google Scholar 

  5. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)

    Google Scholar 

  6. Guo, L., Lu, Z., Yao, L.: Human-machine interaction sensing technology based on hand gesture recognition: a review. IEEE Trans. Hum.-Mach. Syst. 51(4), 300–309 (2021)

    Article  Google Scholar 

  7. Hampiholi, B., Jarvers, C., Mader, W., Neumann, H.: Convolutional transformer fusion blocks for multi-modal gesture recognition. IEEE Access 11, 34094–34103 (2023)

    Article  Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)

    Article  Google Scholar 

  9. Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13677. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_40

  10. Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9772–9781 (2021)

    Google Scholar 

  11. Jocher, G., et al.: ultralytics/yolov5: v3.0. https://github.com/ultralytics/yolov5/. Accessed 15 Feb 2023

  12. Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)

    Google Scholar 

  13. Kopuklu, O., Kose, N., Rigoll, G.: Motion fused frames: data level fusion strategy for hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2103–2111 (2018)

    Google Scholar 

  14. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)

    Google Scholar 

  15. Luo, A., Yang, F., Li, X., Liu, S.: Learning optical flow with kernel patch attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8906–8915 (2022)

    Google Scholar 

  16. Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7 (2015)

    Google Scholar 

  17. Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)

    Google Scholar 

  18. Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  19. Roitberg, A., Pollert, T., Haurilet, M., Martin, M., Stiefelhagen, R.: Analysis of deep fusion strategies for multi-modal gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)

    Google Scholar 

  20. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  21. Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: SKFlow: learning optical flow with super kernels. Adv. Neural. Inf. Process. Syst. 35, 11313–11326 (2022)

    Google Scholar 

  22. Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24

    Chapter  Google Scholar 

  23. Wachs, J.P., Kölsch, M., Stern, H., Edan, Y.: Vision-based hand-gesture applications. Commun. ACM 54(2), 60–71 (2011)

    Article  Google Scholar 

  24. Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)

    Google Scholar 

  25. Zengeler, N., Kopinski, T., Handmann, U.: Hand gesture recognition in automotive human-machine interaction using depth cameras. Sensors 19(1), 59 (2019)

    Article  Google Scholar 

  26. Zhang, W., Wang, J., Lan, F.: Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Automatica Sin. 8(1), 110–120 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gibran Benitez-Garcia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Benitez-Garcia, G., Takahashi, H. (2024). Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization. In: Yan, W.Q., Nguyen, M., Nand, P., Li, X. (eds) Image and Video Technology. PSIVT 2023. Lecture Notes in Computer Science, vol 14403. Springer, Singapore. https://doi.org/10.1007/978-981-97-0376-0_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0376-0_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0375-3

  • Online ISBN: 978-981-97-0376-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics