Abstract
Detecting gestures is a difficult operation, especially when the context is dynamic or noisy. Several approaches use a bounding box for the same, which restricts the usability of the frame as well as the user’s freedom of movement. This paper proposes a novel method for gesture detection in a real time video. It aims not only to simplify the process but also to extract useful and diverse information from the given gestures. The proposed approach utilizes/uses Residual Neural Network (ResNet) 101 to achieve Foreground-Background separation (FBGS) as one of the results, while the MiDas model based on broadband internet is used for Depth map Estimation (DE) of monocular RGB frames of gestures, thus increasing precision and obviating requirement for a bounding box entirely. For comparative analysis, this hierarchical model is evaluated with and without a Graphics Processing Unit (GPU). In this real-time model, the GPU saves 90% of time while simultaneously improving the accuracy of the final result. In the final frame, the noisy backdrop is removed, the gestures are enhanced, and the relative distance between the objects and gestures is highlighted. The proposed algorithm ensures to avoid duplication of gestures.
Similar content being viewed by others
Data availability
The referred papers and data will be available on request.
References
Akilan T, Wu QJ, Safaei A, Huo J, Yang Y (2020) A 3D CNN-LSTM-based image-to- image foreground segmentation. IEEE Trans Intell Transp Syst 21(3):959–971. https://doi.org/10.1109/TITS.2019.2900426
Cabrera-Quiros L, Tax DMJ, Hung H (2020) Gestures in-the-wild: detecting conversational hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration. IEEE Trans Multimed 22(1):138–147. https://doi.org/10.1109/TMM.2019.2922122
Cao Y, Zhao T, Xian K, Shen C, Cao Z, Xu S (2020) Monocular depth estimation with augmented ordinal depth relationships. IEEE Trans Circuits Syst Video Technol 30(8):2674–2682. https://doi.org/10.1109/TCSVT.2019.2929202
Casser V, Pirk S, Mahjourian R, Angelova A (2019) Unsupervised learning of depth and ego-motion: a structured approach. In: AAAI
Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Cui F, Cui Q, Song Y (2021) A survey on learning-based approaches for modeling and classification of human–machine dialog systems. IEEE Trans Neural Netw Learn Syst 32(4):1418–1432. https://doi.org/10.1109/TNNLS.2020.2985588
Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth prediction. In: ICCV
Guo K, Song S, Chang S, Kim T-U, Han S, Kim I (2020) Robust full-fov depth estimation in tele-wide camera system. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1993–1997. https://doi.org/10.1109/ICASSP40776.2020.9053724
Guo C, Jin J, Hou J, Chen J (2020) Accurate light field depth estimation via an occlusion-aware network. 2020 IEEE International Conference on Multimedia and Expo (ICME), pp 1–6. https://doi.org/10.1109/ICME46284.2020.9102829
Hambarde P, Murala S (2020) S2DNet: depth estimation from single image and sparse samples. IEEE Trans Comput Imaging 6:806–817. https://doi.org/10.1109/TCI.2020.2981761
Hambarde P, Dudhane A, Patil PW, Murala S, Dhall A (2020) Depth estimation from single image and semantic prior. 2020 IEEE International Conference on Image Processing (ICIP), pp 1441–1445. https://doi.org/10.1109/ICIP40778.2020.9190985
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR, abs/1512.03385
Hoang VT (2020) HGM-4: a new multi-cameras dataset for hand gesture recognition. Data Brief 30:105676. https://doi.org/10.1016/j.dib.2020.105676
Hu Y-T, Huang J-B, Schwing A (2018) Videomatch: matching based video object segmentation. In: European Conference on Computer Vision (ECCV)
Jégou S, Drozdzal M, Vazquez D, Romero A, Bengio Y (2017) The one hundred layers tiramisu: fully convolutional DenseNets for semantic segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 1175–1183. https://doi.org/10.1109/CVPRW.2017.156
Lee JY, Park R-H (2021) Complex-valued disparity: unified depth model of depth from stereo, depth from focus, and depth from defocus based on the light field gradient. IEEE Trans Pattern Anal Mach Intell 43(3):830–841. https://doi.org/10.1109/TPAMI.2019.2946159
Lim LA, Keles HY (2018) Foreground segmentation using a triplet convolutional neural network for multiscale feature encoding. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1016/j.patrec.2018.08.002
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Lu X, Wang W, Shen J, Crandall D, Luo J (2020) Zero-shot video object segmentation with co-attention Siamese networks. 2020 IEEE transactions on pattern analysis and machine intelligence. https://doi.org/10.1109/TPAMI.2020.3040258
Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego- motion from monocular video using 3D geometric constraints. In: CVPR
Mathew A, Patra AP, Mathew J (2020) Self-attention dense depth estimation network for unrectified video sequences. 2020 IEEE International Conference on Image Processing (ICIP), pp 2810–2814. https://doi.org/10.1109/ICIP40778.2020.9190764
Nair R, Singh DK, Ashu SY, Bakshi S (2020) Hand gesture recognition system for physically challenged people using IoT. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp 671–675. https://doi.org/10.1109/ICACCS48705.2020.9074226
Nowosielski A, Małecki K, Forczmański P, Smoliński A, Krzywicki K (2020) Embedded night-vision system for pedestrian detection. IEEE Sens J 20(16):9293–9304. https://doi.org/10.1109/JSEN.2020.2986855
Oh SW, Lee J, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9225–9234. https://doi.org/10.1109/ICCV.2019.00932
Parger M et al (2022) UNOC: understanding occlusion for embodied presence in virtual reality. IEEE Trans Vis Comput Graph. https://doi.org/10.1109/TVCG.2021.3085407
Pathak D, Girshick R, Dollar P, Darrell T, Hariharan B (2017) Learning features by watching objects move. In: CVPR
Pei S, Li L, Ye L, Dong Y (2020) A tensor foreground-background separation algorithm based on dynamic dictionary update and active contour detection. IEEE Access 8:88259–88272. https://doi.org/10.1109/ACCESS.2020.2992494
Ranftl R, Lasinger K, Hafner D, Schindler K, Koltun V (2019) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3019967
Raza SH, Grundmann M, Essa I (2013) Geometric context from video. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Ren H, El-khamy M, Lee J (2020) Deep monocular video depth estimation using temporal attention. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1988–1992. https://doi.org/10.1109/ICASSP40776.2020.9053408
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, LNCS, vol 9351, pp 234–241
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICAI)
Song M, Lim S, Kim W (2021) Monocular depth estimation using laplacian pyramid- based depth residuals. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2021.3049869
Takamine M, Endo S (2020) Monocular depth estimation with a multi-task and multiple- input architecture using depth gradient. 2020 Joint 11thInternational Conference on Soft Computing and Intelligent Systems and 21stInternational Symposium on Advanced Intelligent Systems (SCIS-ISIS), pp 1–6. https://doi.org/10.1109/SCISISIS50064.2020.9322780
Tsai Y-H, Yang M-H, Black MJ (2016) Video segmentation via object flow. In: CVPR
Vaezi Joze H, Koller O (2018) MS-ASL: a large-scale dataset and benchmark for understanding American Sign Language. The British Machine Vision Conference (BMVC)
Van Luong H, Joukovsky B, Eldar YC, Deligiannis N (2021) A deep-unfolded reference-based RPCA network for video foreground-background separation. 2020 28th European Signal Processing Conference (EUSIPCO), pp 1432–1436. https://doi.org/10.23919/Eusipco47968.2020.9287416
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L (2019) FEELVOS: fast end-to-end embedding learning for video object segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9473–9482. https://doi.org/10.1109/CVPR.2019.00971
Wang Y, Wei H, Ding X, Tao J (2020) Video background/foreground separation model based on non-convex rank approximation RPCA and Superpixel motion detection. IEEE Access 8:157493–157503. https://doi.org/10.1109/ACCESS.2020.3018705
Wang A, Fang Z, Gao Y, Tan S, Wang S, Ma S, Hwang JN (2020) Adversarial learning for joint optimization of depth and Ego-motion. IEEE Trans Image Process 29:4130–4142. https://doi.org/10.1109/TIP.2020.2968751
Yang Y, Yang Z, Li J, Fan L (2020) Foreground-background separation via generalized nuclear norm and structured sparse norm based low-rank and sparse decomposition. IEEE Access 8:84217–84229. https://doi.org/10.1109/ACCESS.2020.2992132
Yang L, Song Q, Wang Z, Hu M, Liu C (2021) Hier R-CNN: instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54. https://doi.org/10.1109/TIP.2020.3029901
Yang Z, Wei Y, Yang Y (2022) Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3081597
Yuan S et al (2020) Chinese sign language alphabet recognition based on random forest algorithm. 2020 IEEE international workshop on metrology for industry 4.0 & IoT, pp 340–344. https://doi.org/10.1109/MetroInd4.0IoT48571.2020.9138285
Zhan H, Garg R, Weerasekera CS, Li K, Agarwal H, Reid ID (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: CVPR
Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: CVPR
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shamalik, R., Koli, S. Effective and efficient approach for gesture detection in video through monocular RGB frames. Multimed Tools Appl 82, 17231–17242 (2023). https://doi.org/10.1007/s11042-022-14207-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14207-x