Skip to main content
Log in

Effective and efficient approach for gesture detection in video through monocular RGB frames

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Detecting gestures is a difficult operation, especially when the context is dynamic or noisy. Several approaches use a bounding box for the same, which restricts the usability of the frame as well as the user’s freedom of movement. This paper proposes a novel method for gesture detection in a real time video. It aims not only to simplify the process but also to extract useful and diverse information from the given gestures. The proposed approach utilizes/uses Residual Neural Network (ResNet) 101 to achieve Foreground-Background separation (FBGS) as one of the results, while the MiDas model based on broadband internet is used for Depth map Estimation (DE) of monocular RGB frames of gestures, thus increasing precision and obviating requirement for a bounding box entirely. For comparative analysis, this hierarchical model is evaluated with and without a Graphics Processing Unit (GPU). In this real-time model, the GPU saves 90% of time while simultaneously improving the accuracy of the final result. In the final frame, the noisy backdrop is removed, the gestures are enhanced, and the relative distance between the objects and gestures is highlighted. The proposed algorithm ensures to avoid duplication of gestures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1:
Algorithm 2:
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The referred papers and data will be available on request.

References

  1. Akilan T, Wu QJ, Safaei A, Huo J, Yang Y (2020) A 3D CNN-LSTM-based image-to- image foreground segmentation. IEEE Trans Intell Transp Syst 21(3):959–971. https://doi.org/10.1109/TITS.2019.2900426

  2. Cabrera-Quiros L, Tax DMJ, Hung H (2020) Gestures in-the-wild: detecting conversational hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration. IEEE Trans Multimed 22(1):138–147. https://doi.org/10.1109/TMM.2019.2922122

  3. Cao Y, Zhao T, Xian K, Shen C, Cao Z, Xu S (2020) Monocular depth estimation with augmented ordinal depth relationships. IEEE Trans Circuits Syst Video Technol 30(8):2674–2682. https://doi.org/10.1109/TCSVT.2019.2929202

    Article  Google Scholar 

  4. Casser V, Pirk S, Mahjourian R, Angelova A (2019) Unsupervised learning of depth and ego-motion: a structured approach. In: AAAI

  5. Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  6. Cui F, Cui Q, Song Y (2021) A survey on learning-based approaches for modeling and classification of human–machine dialog systems. IEEE Trans Neural Netw Learn Syst 32(4):1418–1432. https://doi.org/10.1109/TNNLS.2020.2985588

  7. Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth prediction. In: ICCV

  8. Guo K, Song S, Chang S, Kim T-U, Han S, Kim I (2020) Robust full-fov depth estimation in tele-wide camera system. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1993–1997. https://doi.org/10.1109/ICASSP40776.2020.9053724

  9. Guo C, Jin J, Hou J, Chen J (2020) Accurate light field depth estimation via an occlusion-aware network. 2020 IEEE International Conference on Multimedia and Expo (ICME), pp 1–6. https://doi.org/10.1109/ICME46284.2020.9102829

  10. Hambarde P, Murala S (2020) S2DNet: depth estimation from single image and sparse samples. IEEE Trans Comput Imaging 6:806–817. https://doi.org/10.1109/TCI.2020.2981761

  11. Hambarde P, Dudhane A, Patil PW, Murala S, Dhall A (2020) Depth estimation from single image and semantic prior. 2020 IEEE International Conference on Image Processing (ICIP), pp 1441–1445. https://doi.org/10.1109/ICIP40778.2020.9190985

  12. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR, abs/1512.03385

  13. Hoang VT (2020) HGM-4: a new multi-cameras dataset for hand gesture recognition. Data Brief 30:105676. https://doi.org/10.1016/j.dib.2020.105676

  14. Hu Y-T, Huang J-B, Schwing A (2018) Videomatch: matching based video object segmentation. In: European Conference on Computer Vision (ECCV)

  15. Jégou S, Drozdzal M, Vazquez D, Romero A, Bengio Y (2017) The one hundred layers tiramisu: fully convolutional DenseNets for semantic segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 1175–1183. https://doi.org/10.1109/CVPRW.2017.156

  16. Lee JY, Park R-H (2021) Complex-valued disparity: unified depth model of depth from stereo, depth from focus, and depth from defocus based on the light field gradient. IEEE Trans Pattern Anal Mach Intell 43(3):830–841. https://doi.org/10.1109/TPAMI.2019.2946159

  17. Lim LA, Keles HY (2018) Foreground segmentation using a triplet convolutional neural network for multiscale feature encoding. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1016/j.patrec.2018.08.002

  18. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  19. Lu X, Wang W, Shen J, Crandall D, Luo J (2020) Zero-shot video object segmentation with co-attention Siamese networks. 2020 IEEE transactions on pattern analysis and machine intelligence. https://doi.org/10.1109/TPAMI.2020.3040258

  20. Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego- motion from monocular video using 3D geometric constraints. In: CVPR

  21. Mathew A, Patra AP, Mathew J (2020) Self-attention dense depth estimation network for unrectified video sequences. 2020 IEEE International Conference on Image Processing (ICIP), pp 2810–2814. https://doi.org/10.1109/ICIP40778.2020.9190764

  22. Nair R, Singh DK, Ashu SY, Bakshi S (2020) Hand gesture recognition system for physically challenged people using IoT. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp 671–675. https://doi.org/10.1109/ICACCS48705.2020.9074226

  23. Nowosielski A, Małecki K, Forczmański P, Smoliński A, Krzywicki K (2020) Embedded night-vision system for pedestrian detection. IEEE Sens J 20(16):9293–9304. https://doi.org/10.1109/JSEN.2020.2986855

  24. Oh SW, Lee J, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9225–9234. https://doi.org/10.1109/ICCV.2019.00932

  25. Parger M et al (2022) UNOC: understanding occlusion for embodied presence in virtual reality. IEEE Trans Vis Comput Graph. https://doi.org/10.1109/TVCG.2021.3085407

  26. Pathak D, Girshick R, Dollar P, Darrell T, Hariharan B (2017) Learning features by watching objects move. In: CVPR

  27. Pei S, Li L, Ye L, Dong Y (2020) A tensor foreground-background separation algorithm based on dynamic dictionary update and active contour detection. IEEE Access 8:88259–88272. https://doi.org/10.1109/ACCESS.2020.2992494

  28. Ranftl R, Lasinger K, Hafner D, Schindler K, Koltun V (2019) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3019967

  29. Raza SH, Grundmann M, Essa I (2013) Geometric context from video. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  30. Ren H, El-khamy M, Lee J (2020) Deep monocular video depth estimation using temporal attention. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1988–1992. https://doi.org/10.1109/ICASSP40776.2020.9053408

  31. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, LNCS, vol 9351, pp 234–241

  32. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICAI)

    Google Scholar 

  33. Song M, Lim S, Kim W (2021) Monocular depth estimation using laplacian pyramid- based depth residuals. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2021.3049869

  34. Takamine M, Endo S (2020) Monocular depth estimation with a multi-task and multiple- input architecture using depth gradient. 2020 Joint 11thInternational Conference on Soft Computing and Intelligent Systems and 21stInternational Symposium on Advanced Intelligent Systems (SCIS-ISIS), pp 1–6. https://doi.org/10.1109/SCISISIS50064.2020.9322780

  35. Tsai Y-H, Yang M-H, Black MJ (2016) Video segmentation via object flow. In: CVPR

  36. Vaezi Joze H, Koller O (2018) MS-ASL: a large-scale dataset and benchmark for understanding American Sign Language. The British Machine Vision Conference (BMVC)

  37. Van Luong H, Joukovsky B, Eldar YC, Deligiannis N (2021) A deep-unfolded reference-based RPCA network for video foreground-background separation. 2020 28th European Signal Processing Conference (EUSIPCO), pp 1432–1436. https://doi.org/10.23919/Eusipco47968.2020.9287416

  38. Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L (2019) FEELVOS: fast end-to-end embedding learning for video object segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9473–9482. https://doi.org/10.1109/CVPR.2019.00971

  39. Wang Y, Wei H, Ding X, Tao J (2020) Video background/foreground separation model based on non-convex rank approximation RPCA and Superpixel motion detection. IEEE Access 8:157493–157503. https://doi.org/10.1109/ACCESS.2020.3018705

    Article  Google Scholar 

  40. Wang A, Fang Z, Gao Y, Tan S, Wang S, Ma S, Hwang JN (2020) Adversarial learning for joint optimization of depth and Ego-motion. IEEE Trans Image Process 29:4130–4142. https://doi.org/10.1109/TIP.2020.2968751

  41. Yang Y, Yang Z, Li J, Fan L (2020) Foreground-background separation via generalized nuclear norm and structured sparse norm based low-rank and sparse decomposition. IEEE Access 8:84217–84229. https://doi.org/10.1109/ACCESS.2020.2992132

  42. Yang L, Song Q, Wang Z, Hu M, Liu C (2021) Hier R-CNN: instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54. https://doi.org/10.1109/TIP.2020.3029901

  43. Yang Z, Wei Y, Yang Y (2022) Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3081597

  44. Yuan S et al (2020) Chinese sign language alphabet recognition based on random forest algorithm. 2020 IEEE international workshop on metrology for industry 4.0 & IoT, pp 340–344. https://doi.org/10.1109/MetroInd4.0IoT48571.2020.9138285

  45. Zhan H, Garg R, Weerasekera CS, Li K, Agarwal H, Reid ID (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: CVPR

  46. Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: CVPR

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rameez Shamalik.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shamalik, R., Koli, S. Effective and efficient approach for gesture detection in video through monocular RGB frames. Multimed Tools Appl 82, 17231–17242 (2023). https://doi.org/10.1007/s11042-022-14207-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14207-x

Keywords

Navigation