Human Arm-Leg Smart Gesture-Based Control in Human Computer Interaction Applications

Magdy, Sahar; Youssef, Sherin; Fathy, Cherine

doi:10.1007/978-3-319-64861-3_59

Sahar Magdy¹⁸,
Sherin Youssef¹⁸ &
Cherine Fathy¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 639))

Included in the following conference series:

International Conference on Advanced Intelligent Systems and Informatics

2799 Accesses

Abstract

This paper introduces a new model for arm and leg gestures recognition in a video stream. The proposed model recognizes a set of six specific hand gestures and two specific leg gestures, namely: Full Right-Hand Wave, Full Left-Hand Wave, Full left and right wave, Top Right-Hand Wave, Top Left-Hand Wave, Top both hand wave, Right leg, Left leg, Clapping. The proposed model consists of five phases: video acquisition and preprocessing, video segmentation or depth segmentation, detection, tracking, and classification. The proposed model overcomes the limitations found in previously proposed models as it can be applied on non-stationary background, can deal with noisy video input and requires less time consumption. The advantages mentioned above are achieved due to our most important contribution in the selection of the suitable algorithm that performs our goal efficiently in each phase of the proposed model. The proposed model integrates the separation of foreground movements in video segmentation phase with multi-layer Viola-Jones algorithm in detection phase. The output of these two phases is deployed in tracking motion of the moving region of interest using clustering of feature points, the output of which can be used for understanding simultaneously performed body hand-leg gestures. Our framework uses Kinect camera to connect video streams and integrates various techniques to make tracking tasks efficient. Experiments have been carried out to demonstrate the effectiveness of the proposed model on different benchmarked datasets and a newly generated dataset that was made specifically for this proposed model. IXMAS, Weizmann, and G3d Depth dataset have been used to validate the proposed model which demonstrated an outstanding improvement regarding accuracy for the iXMAS dataset and the G3d dataset.

Download conference paper PDF

Towards an end-to-end isolated and continuous deep gesture recognition process

Article 06 April 2022

Rihem Mahmoud, Selma Belgacem & Mohamed Nazih Omri

Hand Detection and Tracking in Videos for Fine-Grained Action Recognition

Optimal video processing and soft computing algorithms for human hand gesture recognition from real-time video

Article 06 November 2023

Shailaja N. Uke & Amol Zade

Keywords

1 Introduction

The Human Computer Interaction (HCI) concept has been a lively field of research due to how pervasive computerized machines have become in our society influencing every aspect of our lives [1]. Methods and techniques found for gesture recognition in literature still did not record high percentage of accuracy and expandability to broad sets of gestures. The human gesture is naturally expressed with body and hands, ranging from simple gestures used in normal conversations to the more elaborate gestures used to remotely control applications.

Automated recognition of human actions has received considerable attention over the past decade. Efros et al. [2] compute the correlation between two spatiotemporal volumes. They track human actors in a video where an action is defined by optical flow for each of the actor-centered bounding boxes in a time interval. Their only limitation was time consumption because it computes similarity using the cross-correlation score and thus examines each pair of corresponding frames in training and testing video sequences. Song et al. [3] used flow vectors which required the low amount of intervention. Space-time interest point was also used to analyze the structure of local 3D patches in a video like in [4] which represented a video sequence by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. However, it relies only on the power of the individual local space-time descriptors. In [5] self-similarities of action sequences are explored over time and observed to recognize human actions under view changes. A novel approach that uses depth and skeleton data were used in [6] which introduced a new challenging human interaction dataset for multiplayer gaming containing depth and skeleton data.

2 The Proposed Model

Figure 1 presents the proposed model. Each phase will be described briefly in the following subsections.

2.1 Video Preprocessing Phase

A video is a sequence of still images called frames. In this phase, the video is decomposed into n frames so that each frame is processed separately.

2.2 Video Segmentation Phase

This phase aims to isolate the region of interest from the background which is the moving arm or leg in our case. A common approach used especially when the background is non-stationary, and the camera is static, is background subtraction, which identifies regions from the portion of a video frame that differs from the background model [7]. In our proposed model, frame differencing works as follows. First, the first frame is taken as a reference so that the moving region can be detected and segmented by subtracting the first frame from the following n − 1 frames at time t. Mathematically speaking, each pixel in the first frame (I) will be denoted by P(I) and will be subtracted from the corresponding pixels at the same position of the following frames. It is calculated as shown in Eq. 1.

$$ {\text{P}}\left( {{\text{Diff}}\left( {\text{t}} \right)} \right) = {\text{P}}\left( {\text{I}} \right) - {\text{P}}\left( {{\text{F}}\left( {\text{t}} \right)} \right) $$

(1)

Where; F is the following frame at time t, P(F(t)) is the corresponding pixel to P(I), the Diff(t) is the frame that is produced by subtracting the next frame the first frame, the P(Diff(t)) is the corresponding pixel to subtraction of P(I) and P(f(t)) at time t.

The difference between two frames (Diff(t)) would be only the moving region whether it is a leg or an arm as shown in Fig. 2. A threshold is needed next to improve the subtraction of the two frames. In other words, the threshold is used to remove unwanted noise.

$$ \left| {{\text{P}}\left( {\text{I}} \right){-}{\text{P}}\left( {{\text{f}}\left( {\text{t}} \right)} \right)} \right| > {\text{Threshold}} $$

(2)

An automatic global threshold using Otsu’s method [8] was adopted which has given great results in segmentation phase which uses integration of the gray-level histogram. In other words, the Otsu’s algorithm performs clustering-based image thresholding. In some cases, one threshold is not enough because the subject might unintentionally move their heads or clothes while they perform their action. Based on this, another threshold is applied which removes connected white objects in the binary image.

2.3 Depth Segmentation Phase

In Videos captured by Kinect camera, to capture the user’s hand gestures, depth segmentation is introduced to segment the video stream into a sequence of depth images with disparity map and extract the moving region nearest to the Kinect camera as shown below in Fig. 3.

2.4 Detection by Viola Jones

This phase is considered the most important part of activity recognition since the next phases will depend on the accuracy of this phase. Viola Jones algorithm [9] is used for face and body detection. Despite the advantages of the algorithm in selecting features efficiently, it is sensitive to the lighting and distance conditions so the additional layer of detection must be used.

The first layer of detection is face detection, however, if the object is far away or if the resolution of the video is low, it will be hard for the algorithm to detect the face and its features, so another layer is added which is upper body detection. It uses Haar features to encode the details of the head and shoulders area. The output of the detection phase could be one of the following cases shown in Table 1 and illustrated in Fig. 4. In the first and second case, the vertices of the face are going to be passed to the tracking phase, which means even if there were multiple bodies detected in the frame but only one face detected, the detected bodies are ignored, and the face vertices are taken. In the third case, the vertices of the upper body are going to be passed to the tracking phase. The main problem is in the fourth and fifth case which causes confusion to the system so it is solved in the tracking phase.

Table 1. The output of detection phase

Full size table

2.5 Tracking Using Clusters of Directed Feature Points Phase

Tracking phase has two inputs extracted from the previous phases which are: Frames denotes the difference between the reference frame and the next frames (from segmentation stage) and output of the detection phase according to which case the video lies in (from the detection phase). In normal cases like 1, 2 and 3, as shown in Fig. 5 top-down approach is applied on the frames produced from the difference between the reference frame and the next frames, in which the first and the last white pixel read from each frame will be subtracted from the center points which is shown in Fig. 5.

Center points are taken at the center of the body using the coordinates provided by the detection phase. The data passed by the detection stage whether it has detected a face or an upper body are the x-axis, y-axis and the height and width of the detected area. The center points would be right in the middle 1/3 upper part of the body as shown below in Fig. 6. After calculating the center points, the first and last white pixel produced from the top-down approach of each frame produced from the segmentation phase will be subtracted from the center points to create a cloud of points of interest which shows the moving region and tracking it as shown in Fig. 7. In case 4 and 5, another method is used which is getting centers of the detected part. In those cases, the center point is approximately the center of the moving white region. Using top-down approach, the first pixel read from the right side is the one which will be subtracted from the center points.

2.6 Classification Phase

The output produced from the tracking phase is four arrays, x and y arrays (denoted as x1and y1) produced from subtracting the center points from the first pixel of each frame, and second x and y arrays (denoted as x2 and y2) produced from subtracting the center points from the last pixel of each frame. For example, after subtracting the first pixel of segmented frame from the center points, x1 array would be filled with –ve numbers and y1 array would be filled with +ve numbers in case of top left-hand wave and after subtracting the last pixel of each frame from the center points, x2 array would be filled with –ve numbers and y2 array would be filled with +ve numbers for the same case, in other words, top left hand wave would have four arrays filled with positive and negative numbers. In case 4 and 5, since the center points are used, the second arrays of x and y (x2 and y2) will be filled with Zeros. Using those arrays, the moving region will be classified to one of 9 different actions. Our proposed model was experimented on a new dataset shown in Table 2, that we created to contain all the required actions.

Table 2. New generated dataset

Full size table

3 Results and Discussion

The proposed model was deployed on a newly generated dataset which contains all the needed activities, as well as and for comparison purposes, our approach was tested on multiple datasets, and the accuracy of our approach was compared with previous researches.

3.1 New Generated Dataset

As mentioned earlier, a new dataset is generated containing 15 subjects performing 9 different actions which are: full right-hand wave, full left-hand wave, full both hands waves, top right-hand wave, top left-hand wave, top both hand waves, right leg, left leg and clapping as illustrated in Table 2.

In Table 3, the confusion matrix is constructed to show the results of our approach, containing the following activities: top and full right-hand wave, top and full left-hand wave, top and full right and left-hand waves, left leg, right leg and clapping. Experiments were done using Bayes Network classifier with ten cross validations and accuracy of 95.3214% was achieved.

Table 3. Confusion matrix

Full size table

3.2 IXMAS Dataset

INRIA Xmas Motion Acquisition Sequences (IXMAS) is a Multi-view dataset for view-invariant human action recognition. The dataset contains 13 daily-live motions performed each three times by 11 actors. In this dataset, our approach results outperform the previous work as shown in Table 4.

Table 4. Classification results in comparison between proposed approach and previous work applied on IXMAS dataset

Full size table

$$ {\text{Accuracy}} = \frac{TP + TN}{TP + TN + FP + FN} $$

Where;

TP:: = True Positive
TN:: = True Negative
FP:: = False Positive
FN:: = False Negative

3.3 G3d Dataset

The performance of our approach is evaluated using publicly available datasets designed specifically for real-time action recognition: G3D. In Table 5, the confusion matrix is constructed to show the results of our approach applied on depth G3D dataset. The experiments were done using Bayes Network classifier with ten cross validations.

Table 5. Confusion matrix of G3d dataset on Bayes network classifier

Full size table

The G3d dataset was experimented on multiple classifiers using our proposed model, and it achieved outstanding results with every classifier as shown in Table 6.

Table 6. Classification of G3d dataset compared to previous studies

Full size table

4 Conclusions

Most of the complete hand interactive systems can be considered to be comprised of three layers: detection, tracking, and recognition. In our proposed model, subtraction of background method is integrated with viola Jones algorithm for detection, followed by clusters of the directed interest points tracking method to track the motion and the direction of the defined region. In this paper, in our proposed model, a new dataset is created composed of 15 different subjects and tested on multiple classifiers. Moreover, our proposed model was experimented using large benchmarked datasets such as Weizmann, iXMAS and G3d dataset which introduced the impressive improvement of average 10% in term of accuracy.

References

Bowman, M., Debray, S.K., Peterson, L.L.: Reasoning about naming systems. ACM Trans. Program. Lang. Syst. 15(5), 795–825 (1993)
Article Google Scholar
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: IEEE International Conference on Computer Vision, 13–16 October 2003, Nice, France (2003)
Google Scholar
Song, Y., Goncalves, L., Perona, P.: Unsupervised learning of human motion. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 814–827 (2003)
Article Google Scholar
Liu, J., Shah, M.: Learning human actions via information maximization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, AK, USA, 23–28 June 2008, pp. 1–8 (2008)
Google Scholar
Junejo, J.N., Laptev, I., Perez, P.: View-independent action recognition from temporal self-similarities. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 172–185 (2011)
Article Google Scholar
Bloom, V., Argyriou, V., Makris, D.: G3Di: a gaming interaction dataset with a real-time detection and evaluation framework. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 16–21 June 2012, Providence, RI, USA, pp. 7–12 (2014)
Google Scholar
Augustin, M.B., Juliet, S., Palanikumar, S.: Motion and feature based person tracing in surveillance videos. In: Transactions on Computer Vision, Published International Conference on Emerging Trends in Electrical and Computer Technology, 23–24 March 2011
Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article Google Scholar
Wang, Y.Q.: An analysis of the Viola-Jones face detection algorithm. Image Process. Line 4, 128–148 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Department College of Engineering, Arab Academy for Science and Technology (AAST), Alexandria, Egypt
Sahar Magdy, Sherin Youssef & Cherine Fathy

Authors

Sahar Magdy
View author publications
You can also search for this author in PubMed Google Scholar
Sherin Youssef
View author publications
You can also search for this author in PubMed Google Scholar
Cherine Fathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sahar Magdy .

Editor information

Editors and Affiliations

Faculty of Computers and Information, Information Technology Department, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Dubai International Academic City, The British University in Dubai, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
Tarek Gaber
Ain Shams University, Cairo, Egypt
Mohamed F. Tolba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Magdy, S., Youssef, S., Fathy, C. (2018). Human Arm-Leg Smart Gesture-Based Control in Human Computer Interaction Applications. In: Hassanien, A., Shaalan, K., Gaber, T., Tolba, M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017. AISI 2017. Advances in Intelligent Systems and Computing, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-64861-3_59

Download citation

DOI: https://doi.org/10.1007/978-3-319-64861-3_59
Published: 31 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64860-6
Online ISBN: 978-3-319-64861-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Human Arm-Leg Smart Gesture-Based Control in Human Computer Interaction Applications

Abstract

Similar content being viewed by others

Towards an end-to-end isolated and continuous deep gesture recognition process

Hand Detection and Tracking in Videos for Fine-Grained Action Recognition

Optimal video processing and soft computing algorithms for human hand gesture recognition from real-time video

Keywords

1 Introduction