Abstract
Hand Gesture Recognition (HGR) has been significantly advanced through multimodal approaches utilizing RGB and Optical Flow (OF). Yet, two main challenges often remain (i) The computational burden triggered by advanced techniques which rely on intricate multi-level fusion blocks distributed across the architecture, and (ii) the limited exploration into the impact of OF estimators on multimodal fusion. To address these, this paper introduces an efficient RGB+OF fusion relying on just a few 3DConv layers applied early in the architecture. Concurrently, we explore the impact of five state-of-the-art OF methods on this fusion. Advancing beyond traditional HGR, we prioritize recognizing and precisely localizing the hand gesture, which is critical for a wide range of computer vision applications. Thus transitioning the focus to Hand Gesture Recognition and Localization (HGRL). Accordingly, we employ a YOLO-based architecture renowned for its real-time efficacy and precision in object localization, aligning with the demands of dynamic gestures often seen in HGRL. We evaluate our approach with the IPN-Hand dataset, augmenting its scope for HGRL evaluation by manually annotating 82,769 frames. Our experiments show significant results of 10% enhancement in mAP against the RGB-only method and a 7% gain over 2DConv-based fusion.
Supported by a Research Grant (S) at Tateisi Science and Technology Foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Asadi-Aghbolaghi, M., et al.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 476–483. IEEE (2017)
Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez, G., Yanai, K.: IPN hand: a video dataset and benchmark for real-time continuous hand gesture recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347. IEEE (2021)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 107, 3–11 (2018)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Guo, L., Lu, Z., Yao, L.: Human-machine interaction sensing technology based on hand gesture recognition: a review. IEEE Trans. Hum.-Mach. Syst. 51(4), 300–309 (2021)
Hampiholi, B., Jarvers, C., Mader, W., Neumann, H.: Convolutional transformer fusion blocks for multi-modal gesture recognition. IEEE Access 11, 34094–34103 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Huang, Z., et al.: FlowFormer: a transformer architecture for optical flow. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13677. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_40
Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9772–9781 (2021)
Jocher, G., et al.: ultralytics/yolov5: v3.0. https://github.com/ultralytics/yolov5/. Accessed 15 Feb 2023
Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)
Kopuklu, O., Kose, N., Rigoll, G.: Motion fused frames: data level fusion strategy for hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2103–2111 (2018)
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)
Luo, A., Yang, F., Li, X., Liu, S.: Learning optical flow with kernel patch attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8906–8915 (2022)
Molchanov, P., Gupta, S., Kim, K., Kautz, J.: Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7 (2015)
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)
Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Roitberg, A., Pollert, T., Haurilet, M., Martin, M., Stiefelhagen, R.: Analysis of deep fusion strategies for multi-modal gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: SKFlow: learning optical flow with super kernels. Adv. Neural. Inf. Process. Syst. 35, 11313–11326 (2022)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Wachs, J.P., Kölsch, M., Stern, H., Edan, Y.: Vision-based hand-gesture applications. Commun. ACM 54(2), 60–71 (2011)
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020)
Zengeler, N., Kopinski, T., Handmann, U.: Hand gesture recognition in automotive human-machine interaction using depth cameras. Sensors 19(1), 59 (2019)
Zhang, W., Wang, J., Lan, F.: Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Automatica Sin. 8(1), 110–120 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Benitez-Garcia, G., Takahashi, H. (2024). Efficient 3Dconv Fusion of RGB and Optical Flow for Dynamic Hand Gesture Recognition and Localization. In: Yan, W.Q., Nguyen, M., Nand, P., Li, X. (eds) Image and Video Technology. PSIVT 2023. Lecture Notes in Computer Science, vol 14403. Springer, Singapore. https://doi.org/10.1007/978-981-97-0376-0_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-0376-0_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0375-3
Online ISBN: 978-981-97-0376-0
eBook Packages: Computer ScienceComputer Science (R0)