Keywords

1 Introduction

The Human Computer Interaction (HCI) concept has been a lively field of research due to how pervasive computerized machines have become in our society influencing every aspect of our lives [1]. Methods and techniques found for gesture recognition in literature still did not record high percentage of accuracy and expandability to broad sets of gestures. The human gesture is naturally expressed with body and hands, ranging from simple gestures used in normal conversations to the more elaborate gestures used to remotely control applications.

Automated recognition of human actions has received considerable attention over the past decade. Efros et al. [2] compute the correlation between two spatiotemporal volumes. They track human actors in a video where an action is defined by optical flow for each of the actor-centered bounding boxes in a time interval. Their only limitation was time consumption because it computes similarity using the cross-correlation score and thus examines each pair of corresponding frames in training and testing video sequences. Song et al. [3] used flow vectors which required the low amount of intervention. Space-time interest point was also used to analyze the structure of local 3D patches in a video like in [4] which represented a video sequence by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. However, it relies only on the power of the individual local space-time descriptors. In [5] self-similarities of action sequences are explored over time and observed to recognize human actions under view changes. A novel approach that uses depth and skeleton data were used in [6] which introduced a new challenging human interaction dataset for multiplayer gaming containing depth and skeleton data.

2 The Proposed Model

Figure 1 presents the proposed model. Each phase will be described briefly in the following subsections.

Fig. 1.
figure 1

The proposed model diagram and its phases

2.1 Video Preprocessing Phase

A video is a sequence of still images called frames. In this phase, the video is decomposed into n frames so that each frame is processed separately.

2.2 Video Segmentation Phase

This phase aims to isolate the region of interest from the background which is the moving arm or leg in our case. A common approach used especially when the background is non-stationary, and the camera is static, is background subtraction, which identifies regions from the portion of a video frame that differs from the background model [7]. In our proposed model, frame differencing works as follows. First, the first frame is taken as a reference so that the moving region can be detected and segmented by subtracting the first frame from the following n − 1 frames at time t. Mathematically speaking, each pixel in the first frame (I) will be denoted by P(I) and will be subtracted from the corresponding pixels at the same position of the following frames. It is calculated as shown in Eq. 1.

$$ {\text{P}}\left( {{\text{Diff}}\left( {\text{t}} \right)} \right) = {\text{P}}\left( {\text{I}} \right) - {\text{P}}\left( {{\text{F}}\left( {\text{t}} \right)} \right) $$
(1)

Where; F is the following frame at time t, P(F(t)) is the corresponding pixel to P(I), the Diff(t) is the frame that is produced by subtracting the next frame the first frame, the P(Diff(t)) is the corresponding pixel to subtraction of P(I) and P(f(t)) at time t.

The difference between two frames (Diff(t)) would be only the moving region whether it is a leg or an arm as shown in Fig. 2. A threshold is needed next to improve the subtraction of the two frames. In other words, the threshold is used to remove unwanted noise.

Fig. 2.
figure 2

(Diff(t)): difference between two frame

$$ \left| {{\text{P}}\left( {\text{I}} \right){-}{\text{P}}\left( {{\text{f}}\left( {\text{t}} \right)} \right)} \right| > {\text{Threshold}} $$
(2)

An automatic global threshold using Otsu’s method [8] was adopted which has given great results in segmentation phase which uses integration of the gray-level histogram. In other words, the Otsu’s algorithm performs clustering-based image thresholding. In some cases, one threshold is not enough because the subject might unintentionally move their heads or clothes while they perform their action. Based on this, another threshold is applied which removes connected white objects in the binary image.

2.3 Depth Segmentation Phase

In Videos captured by Kinect camera, to capture the user’s hand gestures, depth segmentation is introduced to segment the video stream into a sequence of depth images with disparity map and extract the moving region nearest to the Kinect camera as shown below in Fig. 3.

Fig. 3.
figure 3

Depth disparity map

2.4 Detection by Viola Jones

This phase is considered the most important part of activity recognition since the next phases will depend on the accuracy of this phase. Viola Jones algorithm [9] is used for face and body detection. Despite the advantages of the algorithm in selecting features efficiently, it is sensitive to the lighting and distance conditions so the additional layer of detection must be used.

The first layer of detection is face detection, however, if the object is far away or if the resolution of the video is low, it will be hard for the algorithm to detect the face and its features, so another layer is added which is upper body detection. It uses Haar features to encode the details of the head and shoulders area. The output of the detection phase could be one of the following cases shown in Table 1 and illustrated in Fig. 4. In the first and second case, the vertices of the face are going to be passed to the tracking phase, which means even if there were multiple bodies detected in the frame but only one face detected, the detected bodies are ignored, and the face vertices are taken. In the third case, the vertices of the upper body are going to be passed to the tracking phase. The main problem is in the fourth and fifth case which causes confusion to the system so it is solved in the tracking phase.

Table 1. The output of detection phase
Fig. 4.
figure 4

Example in each case described in Table 1

2.5 Tracking Using Clusters of Directed Feature Points Phase

Tracking phase has two inputs extracted from the previous phases which are: Frames denotes the difference between the reference frame and the next frames (from segmentation stage) and output of the detection phase according to which case the video lies in (from the detection phase). In normal cases like 1, 2 and 3, as shown in Fig. 5 top-down approach is applied on the frames produced from the difference between the reference frame and the next frames, in which the first and the last white pixel read from each frame will be subtracted from the center points which is shown in Fig. 5.

Fig. 5.
figure 5

Top down approach

Center points are taken at the center of the body using the coordinates provided by the detection phase. The data passed by the detection stage whether it has detected a face or an upper body are the x-axis, y-axis and the height and width of the detected area. The center points would be right in the middle 1/3 upper part of the body as shown below in Fig. 6. After calculating the center points, the first and last white pixel produced from the top-down approach of each frame produced from the segmentation phase will be subtracted from the center points to create a cloud of points of interest which shows the moving region and tracking it as shown in Fig. 7. In case 4 and 5, another method is used which is getting centers of the detected part. In those cases, the center point is approximately the center of the moving white region. Using top-down approach, the first pixel read from the right side is the one which will be subtracted from the center points.

Fig. 6.
figure 6

Center point

Fig. 7.
figure 7

Cloud of points

2.6 Classification Phase

The output produced from the tracking phase is four arrays, x and y arrays (denoted as x1and y1) produced from subtracting the center points from the first pixel of each frame, and second x and y arrays (denoted as x2 and y2) produced from subtracting the center points from the last pixel of each frame. For example, after subtracting the first pixel of segmented frame from the center points, x1 array would be filled with –ve numbers and y1 array would be filled with +ve numbers in case of top left-hand wave and after subtracting the last pixel of each frame from the center points, x2 array would be filled with –ve numbers and y2 array would be filled with +ve numbers for the same case, in other words, top left hand wave would have four arrays filled with positive and negative numbers. In case 4 and 5, since the center points are used, the second arrays of x and y (x2 and y2) will be filled with Zeros. Using those arrays, the moving region will be classified to one of 9 different actions. Our proposed model was experimented on a new dataset shown in Table 2, that we created to contain all the required actions.

Table 2. New generated dataset

3 Results and Discussion

The proposed model was deployed on a newly generated dataset which contains all the needed activities, as well as and for comparison purposes, our approach was tested on multiple datasets, and the accuracy of our approach was compared with previous researches.

3.1 New Generated Dataset

As mentioned earlier, a new dataset is generated containing 15 subjects performing 9 different actions which are: full right-hand wave, full left-hand wave, full both hands waves, top right-hand wave, top left-hand wave, top both hand waves, right leg, left leg and clapping as illustrated in Table 2.

In Table 3, the confusion matrix is constructed to show the results of our approach, containing the following activities: top and full right-hand wave, top and full left-hand wave, top and full right and left-hand waves, left leg, right leg and clapping. Experiments were done using Bayes Network classifier with ten cross validations and accuracy of 95.3214% was achieved.

Table 3. Confusion matrix

3.2 IXMAS Dataset

INRIA Xmas Motion Acquisition Sequences (IXMAS) is a Multi-view dataset for view-invariant human action recognition. The dataset contains 13 daily-live motions performed each three times by 11 actors. In this dataset, our approach results outperform the previous work as shown in Table 4.

Table 4. Classification results in comparison between proposed approach and previous work applied on IXMAS dataset
$$ {\text{Accuracy}} = \frac{TP + TN}{TP + TN + FP + FN} $$

Where;

TP:

= True Positive

TN:

= True Negative

FP:

= False Positive

FN:

= False Negative

3.3 G3d Dataset

The performance of our approach is evaluated using publicly available datasets designed specifically for real-time action recognition: G3D. In Table 5, the confusion matrix is constructed to show the results of our approach applied on depth G3D dataset. The experiments were done using Bayes Network classifier with ten cross validations.

Table 5. Confusion matrix of G3d dataset on Bayes network classifier

The G3d dataset was experimented on multiple classifiers using our proposed model, and it achieved outstanding results with every classifier as shown in Table 6.

Table 6. Classification of G3d dataset compared to previous studies

4 Conclusions

Most of the complete hand interactive systems can be considered to be comprised of three layers: detection, tracking, and recognition. In our proposed model, subtraction of background method is integrated with viola Jones algorithm for detection, followed by clusters of the directed interest points tracking method to track the motion and the direction of the defined region. In this paper, in our proposed model, a new dataset is created composed of 15 different subjects and tested on multiple classifiers. Moreover, our proposed model was experimented using large benchmarked datasets such as Weizmann, iXMAS and G3d dataset which introduced the impressive improvement of average 10% in term of accuracy.