Effective and efficient approach for gesture detection in video through monocular RGB frames

Shamalik, Rameez; Koli, Sanjay

doi:10.1007/s11042-022-14207-x

Effective and efficient approach for gesture detection in video through monocular RGB frames

Published: 14 November 2022

Volume 82, pages 17231–17242, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Rameez Shamalik^1,2 &
Sanjay Koli^1,3

185 Accesses
1 Altmetric
Explore all metrics

Abstract

Detecting gestures is a difficult operation, especially when the context is dynamic or noisy. Several approaches use a bounding box for the same, which restricts the usability of the frame as well as the user’s freedom of movement. This paper proposes a novel method for gesture detection in a real time video. It aims not only to simplify the process but also to extract useful and diverse information from the given gestures. The proposed approach utilizes/uses Residual Neural Network (ResNet) 101 to achieve Foreground-Background separation (FBGS) as one of the results, while the MiDas model based on broadband internet is used for Depth map Estimation (DE) of monocular RGB frames of gestures, thus increasing precision and obviating requirement for a bounding box entirely. For comparative analysis, this hierarchical model is evaluated with and without a Graphics Processing Unit (GPU). In this real-time model, the GPU saves 90% of time while simultaneously improving the accuracy of the final result. In the final frame, the noisy backdrop is removed, the gestures are enhanced, and the relative distance between the objects and gestures is highlighted. The proposed algorithm ensures to avoid duplication of gestures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards an end-to-end isolated and continuous deep gesture recognition process

Article 06 April 2022

A New RGB-D Gesture Video Dataset and Its Benchmark Evaluations on Light-Weighted Networks

mXception and dynamic image for hand gesture recognition

Article 17 February 2024

Data availability

The referred papers and data will be available on request.

References

Akilan T, Wu QJ, Safaei A, Huo J, Yang Y (2020) A 3D CNN-LSTM-based image-to- image foreground segmentation. IEEE Trans Intell Transp Syst 21(3):959–971. https://doi.org/10.1109/TITS.2019.2900426
Cabrera-Quiros L, Tax DMJ, Hung H (2020) Gestures in-the-wild: detecting conversational hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration. IEEE Trans Multimed 22(1):138–147. https://doi.org/10.1109/TMM.2019.2922122
Cao Y, Zhao T, Xian K, Shen C, Cao Z, Xu S (2020) Monocular depth estimation with augmented ordinal depth relationships. IEEE Trans Circuits Syst Video Technol 30(8):2674–2682. https://doi.org/10.1109/TCSVT.2019.2929202
Article Google Scholar
Casser V, Pirk S, Mahjourian R, Angelova A (2019) Unsupervised learning of depth and ego-motion: a structured approach. In: AAAI
Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Cui F, Cui Q, Song Y (2021) A survey on learning-based approaches for modeling and classification of human–machine dialog systems. IEEE Trans Neural Netw Learn Syst 32(4):1418–1432. https://doi.org/10.1109/TNNLS.2020.2985588
Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth prediction. In: ICCV
Guo K, Song S, Chang S, Kim T-U, Han S, Kim I (2020) Robust full-fov depth estimation in tele-wide camera system. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1993–1997. https://doi.org/10.1109/ICASSP40776.2020.9053724
Guo C, Jin J, Hou J, Chen J (2020) Accurate light field depth estimation via an occlusion-aware network. 2020 IEEE International Conference on Multimedia and Expo (ICME), pp 1–6. https://doi.org/10.1109/ICME46284.2020.9102829
Hambarde P, Murala S (2020) S2DNet: depth estimation from single image and sparse samples. IEEE Trans Comput Imaging 6:806–817. https://doi.org/10.1109/TCI.2020.2981761
Hambarde P, Dudhane A, Patil PW, Murala S, Dhall A (2020) Depth estimation from single image and semantic prior. 2020 IEEE International Conference on Image Processing (ICIP), pp 1441–1445. https://doi.org/10.1109/ICIP40778.2020.9190985
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR, abs/1512.03385
Hoang VT (2020) HGM-4: a new multi-cameras dataset for hand gesture recognition. Data Brief 30:105676. https://doi.org/10.1016/j.dib.2020.105676
Hu Y-T, Huang J-B, Schwing A (2018) Videomatch: matching based video object segmentation. In: European Conference on Computer Vision (ECCV)
Jégou S, Drozdzal M, Vazquez D, Romero A, Bengio Y (2017) The one hundred layers tiramisu: fully convolutional DenseNets for semantic segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 1175–1183. https://doi.org/10.1109/CVPRW.2017.156
Lee JY, Park R-H (2021) Complex-valued disparity: unified depth model of depth from stereo, depth from focus, and depth from defocus based on the light field gradient. IEEE Trans Pattern Anal Mach Intell 43(3):830–841. https://doi.org/10.1109/TPAMI.2019.2946159
Lim LA, Keles HY (2018) Foreground segmentation using a triplet convolutional neural network for multiscale feature encoding. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1016/j.patrec.2018.08.002
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Lu X, Wang W, Shen J, Crandall D, Luo J (2020) Zero-shot video object segmentation with co-attention Siamese networks. 2020 IEEE transactions on pattern analysis and machine intelligence. https://doi.org/10.1109/TPAMI.2020.3040258
Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego- motion from monocular video using 3D geometric constraints. In: CVPR
Mathew A, Patra AP, Mathew J (2020) Self-attention dense depth estimation network for unrectified video sequences. 2020 IEEE International Conference on Image Processing (ICIP), pp 2810–2814. https://doi.org/10.1109/ICIP40778.2020.9190764
Nair R, Singh DK, Ashu SY, Bakshi S (2020) Hand gesture recognition system for physically challenged people using IoT. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp 671–675. https://doi.org/10.1109/ICACCS48705.2020.9074226
Nowosielski A, Małecki K, Forczmański P, Smoliński A, Krzywicki K (2020) Embedded night-vision system for pedestrian detection. IEEE Sens J 20(16):9293–9304. https://doi.org/10.1109/JSEN.2020.2986855
Oh SW, Lee J, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9225–9234. https://doi.org/10.1109/ICCV.2019.00932
Parger M et al (2022) UNOC: understanding occlusion for embodied presence in virtual reality. IEEE Trans Vis Comput Graph. https://doi.org/10.1109/TVCG.2021.3085407
Pathak D, Girshick R, Dollar P, Darrell T, Hariharan B (2017) Learning features by watching objects move. In: CVPR
Pei S, Li L, Ye L, Dong Y (2020) A tensor foreground-background separation algorithm based on dynamic dictionary update and active contour detection. IEEE Access 8:88259–88272. https://doi.org/10.1109/ACCESS.2020.2992494
Ranftl R, Lasinger K, Hafner D, Schindler K, Koltun V (2019) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3019967
Raza SH, Grundmann M, Essa I (2013) Geometric context from video. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Ren H, El-khamy M, Lee J (2020) Deep monocular video depth estimation using temporal attention. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1988–1992. https://doi.org/10.1109/ICASSP40776.2020.9053408
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, LNCS, vol 9351, pp 234–241
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICAI)
Google Scholar
Song M, Lim S, Kim W (2021) Monocular depth estimation using laplacian pyramid- based depth residuals. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2021.3049869
Takamine M, Endo S (2020) Monocular depth estimation with a multi-task and multiple- input architecture using depth gradient. 2020 Joint 11^thInternational Conference on Soft Computing and Intelligent Systems and 21^stInternational Symposium on Advanced Intelligent Systems (SCIS-ISIS), pp 1–6. https://doi.org/10.1109/SCISISIS50064.2020.9322780
Tsai Y-H, Yang M-H, Black MJ (2016) Video segmentation via object flow. In: CVPR
Vaezi Joze H, Koller O (2018) MS-ASL: a large-scale dataset and benchmark for understanding American Sign Language. The British Machine Vision Conference (BMVC)
Van Luong H, Joukovsky B, Eldar YC, Deligiannis N (2021) A deep-unfolded reference-based RPCA network for video foreground-background separation. 2020 28th European Signal Processing Conference (EUSIPCO), pp 1432–1436. https://doi.org/10.23919/Eusipco47968.2020.9287416
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L (2019) FEELVOS: fast end-to-end embedding learning for video object segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9473–9482. https://doi.org/10.1109/CVPR.2019.00971
Wang Y, Wei H, Ding X, Tao J (2020) Video background/foreground separation model based on non-convex rank approximation RPCA and Superpixel motion detection. IEEE Access 8:157493–157503. https://doi.org/10.1109/ACCESS.2020.3018705
Article Google Scholar
Wang A, Fang Z, Gao Y, Tan S, Wang S, Ma S, Hwang JN (2020) Adversarial learning for joint optimization of depth and Ego-motion. IEEE Trans Image Process 29:4130–4142. https://doi.org/10.1109/TIP.2020.2968751
Yang Y, Yang Z, Li J, Fan L (2020) Foreground-background separation via generalized nuclear norm and structured sparse norm based low-rank and sparse decomposition. IEEE Access 8:84217–84229. https://doi.org/10.1109/ACCESS.2020.2992132
Yang L, Song Q, Wang Z, Hu M, Liu C (2021) Hier R-CNN: instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54. https://doi.org/10.1109/TIP.2020.3029901
Yang Z, Wei Y, Yang Y (2022) Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3081597
Yuan S et al (2020) Chinese sign language alphabet recognition based on random forest algorithm. 2020 IEEE international workshop on metrology for industry 4.0 & IoT, pp 340–344. https://doi.org/10.1109/MetroInd4.0IoT48571.2020.9138285
Zhan H, Garg R, Weerasekera CS, Li K, Agarwal H, Reid ID (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: CVPR
Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: CVPR

Download references

Author information

Authors and Affiliations

G H Raisoni College of Engineering and Management, Pune, India
Rameez Shamalik & Sanjay Koli
BVCOEW, Pune, India
Rameez Shamalik
Dr.D.Y. Patil School of Engineering, Pune, India
Sanjay Koli

Authors

Rameez Shamalik
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Koli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rameez Shamalik.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shamalik, R., Koli, S. Effective and efficient approach for gesture detection in video through monocular RGB frames. Multimed Tools Appl 82, 17231–17242 (2023). https://doi.org/10.1007/s11042-022-14207-x

Download citation

Received: 14 December 2021
Revised: 01 June 2022
Accepted: 27 October 2022
Published: 14 November 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-14207-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effective and efficient approach for gesture detection in video through monocular RGB frames

Abstract

Access this article

Similar content being viewed by others

Towards an end-to-end isolated and continuous deep gesture recognition process

A New RGB-D Gesture Video Dataset and Its Benchmark Evaluations on Light-Weighted Networks

mXception and dynamic image for hand gesture recognition

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effective and efficient approach for gesture detection in video through monocular RGB frames

Abstract

Access this article

Similar content being viewed by others

Towards an end-to-end isolated and continuous deep gesture recognition process

A New RGB-D Gesture Video Dataset and Its Benchmark Evaluations on Light-Weighted Networks

mXception and dynamic image for hand gesture recognition

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation