1 Introduction

Accurate and effective human recognition is one of the major research areas of computer vision, pattern recognition, biometrics, and intelligent surveillance. Face recognition [1], fingerprint recognition [2] technologies are developed earlier and mature. But this biometrics need to be detected with subjects’ cooperate, and require a high image resolution. Gait recognition technology can be detected in a long distance, and need not subjects’ cooperate. And there are already studies using gait biometrics as a forensic tool [18].

Existing gait recognition methods can be roughly divided into two main groups: model-free and model-based. Model-free gait recognition technologies include a variety of silhouettes-based methods. Phillips et al. [17] proposed a baseline algorithm using the correlation of silhouettes. Gait Energy Image (GEI) [3] used the average image of the silhouettes in a gait period to characterizing gait features. Gait Entropy Image (GEnI) [4] and Chrono-Gait Image (CGI) [5] are similar to GEI [3]. Model-free method is simple and of low computational complexity, but performs poorly for the viewpoint and the occlusion problem. Model-based gait recognition methods model the structure of the human body using body structure parameters. Early model-based methods conclude the pendulum model [6], stick model [7] and so on.

Gait features can be divided into two categories: dynamic and static. Silhouette-based methods employ both dynamic and static features for gait recognition except Phillips’ [17] method. GEI [3] encodes static characteristic more, while GEnI [4] and CGI [5] encoding dynamic features more. Many previous works argue that, although motion reflects the essential nature of gait, the recognition performance based purely on the motion feature is limited [9, 15, 16]. However, is the distinguishing ability of the motion itself limited, or existing features doesn’t adequately capture the characteristics of motion? This paper will explore this problem in the Kinect-based gait recognition.

Kinect’s skeleton tracking function can provide real-time 3D coordinates of 20 human skeleton points, eliminating the need for complex extraction procedures of human model. Similarly, Kinect-based gait feature can also be divided into static and dynamic features. Static features mainly refer to the skeletons length, height and so on. For example, Araujo et al. [13] extracted a total of 11 features includes skeletons length and height as static features.

Dynamic features can be subdivided into four categories: (i) the intuitive dynamic features: step length, speed, and gait cycle. Preis et al. [8] extracted the step length and speed as dynamic gait features; (ii) angle-based dynamic features: statistical characteristics of bone angles. [9, 10] extracted statistical characteristics of the lower body angles as dynamic features; (iii) absolute motion features: statistical characteristics of skeleton points’ absolute coordinates’ changes. Vertical Distance Features (VDF) [11] described the statistical characteristics of the absolute coordinate changes of some joints when walking; (iv) relative motion features: relative motion features extract statistical characteristics from relative distances between joints, avoiding the problem of absolute motion features.

Chattopadhyay et al. [12] used the average relative distance between some joints as dynamic gait features and got a poor performance. But only using the mean value is not enough, standard deviation also implies sufficient dynamic characteristics of walking; the combination of the two will lead to a better result.

This paper extracted the relative motion features as dynamic gait features. Proposed robust relative motion features’ recognition accuracy is up to 85 %, and don’t need calculate the gait cycle. When combined with static features, recognition accuracy is above 95 %.

The rest of the paper is organized as follows. The proposed method is presented in Sect. 2, and the experiment and results are shown in Sect. 3. Finally, conclusion is drawn in Sect. 4.

2 Proposed Method

This section describes the proposed method and is divided into four parts. Firstly, Kinect skeleton data stream is introduced, followed by the extraction of dynamic and static features of gait, and the last parts are feature selection and classification.

2.1 Kinect Skeleton Data

This paper only uses skeleton data stream. Kinect v1.0 can provide 3D coordinates of the 20 joints of two human bodies at 30fps, as shown in Fig. 1(a). The coordinate system is shown in Fig. 1(b), the origin of the Cartesian coordinate system is the depth sensor’s center; x-axis is parallel with Kinect; y-axis is vertical to bottom surface of Kinect; z-axis is the direction parallel to the sensor’s normal direction. The units are in meters. This paper will extract dynamic and static gait features on Kinect skeleton data stream to provide for identifying people.

Fig. 1.
figure 1

(a) 20 joints of skeleton; (b) Kinect depth sensor coordinate system.

2.2 Distance-Based Relative Motion Features Extraction

In practical application scenarios, persons will not walk in the preset direction as in the experimental environment; Kinect Sensor will move or rotate to get a larger view. Therefore, the absolute coordinates of human joints is not suitable for characterizing human gait features in the real scene. All dynamic features are extracted from the relative distances of the particular joints. The relative distances are divided into three groups according to directions: x direction, y direction and z direction. The directions are the same to the coordinate system’s axis direction. The 11 relative distances are shown as formula (1),

$$ \begin{array}{l} Dx1 = abs(x(17) - x(18));\hfill\\ Dx2 = abs(x(5) - x(6));\hfill\\ Dx3 = abs(x(9) - x(10));\hfill\\ Dx4 = abs(x(1) - (x(17) + x(18))/2);\hfill\\ Dx5 = abs(x(11) - (x(17) + x(18))/2);\hfill\\ Dx6 = abs(x(7) - x(8));\hfill\\ Dx7 = abs(x(3) - x(4));\hfill\\ Dy1 = abs(y(1) - (y(19) + y(20))/2);\hfill\\ Dy2 = abs(y(1) - (y(15) + y(16))/2);\hfill\\ Dy3 = abs(y(19) - y(20));\hfill\\ Dz1 = abs(z(9) - z(10));\hfill\\\end{array} $$
(1)

Where Dxi (or Dyj, Dzk) stands for distances in x-axis (or y-axis, z-axis) direction, x(i) (ory(j), z(k)) stands for the ith(or jth, kth) joint’s x (or y, z) coordinate. abs(·) stands for absolute value. For example, Dx1stands for distance between two ankles in x-direction.

Relative motion features of x (or y, z) direction mainly refers to statistical characteristic (i.e., mean and standard deviations) of relative distances of particular joints in the x (or y, z) direction, as shown in formula (2),

$$ \begin{array}{l} STD = std\{ Dx1,Dx2,Dx3,Dx4,Dx5,Dx6,Dx7,Dy1,Dy2,Dy3\};\hfill\\ MEAN = mean\{ Dx1,Dx2,Dx3,Dx4,Dx5,Dx6,Dy1,Dy2,Dy3,Dz1\};\hfill\\ Dynamic\;GF =\{STD,MEAN\};\hfill\\\end{array} $$
(2)

Where std(·) stands for standard deviation function, mean(·) stands for mean value function. STD is the vector of standard deviation, MEAN is the vector of mean value. The two sets together are combined into relative motion feature vector Dynamic GF, whose length is 20.

2.3 Static Gait Feature Extraction

Similar to [9, 13], proposed static gait features mainly refer to the length of the different body parts. All the 19 segments (bones in Fig. 1(a)) lengths set and Height together make up static gait features, totally 20. The height is defined as the sum of neck length, upper and lower spine length, and average leg length.

For each frame static gait features are calculated. The mean and standard deviation are calculated over the all frames, and means of each component is recalculated after remove outliers beyond two standard deviations from mean. And the new means make up the final static gait features vector, namely Static GF.

Proposed Dynamic GF and Static GF together make up the combined gait features vector Combined GF, whose length is 40.

2.4 Feature Selection and Classification

There are a lot noise when capture Kinect skeleton data stream. And what features are relevant are uncertain due to the unknown walking conditions, so a fixed feature selection is difficult. To track this problem, a classifier ensemble method based on Random Subspace Method (RSM) and Majority Voting (MV) is employed as feature selection and classification method [14]. And K-Nearest Neighbor (KNN) with Manhattan distance as the distance metric is employed for classification. As shown in Fig. 2(a), the classification results are obtained from the voting results of L weak classifiers, GF stands for gait features space, R stands for random subspace of GF. In order to achieve higher recognition rate, 10-fold cross-validation will be used to select the best K value for KNN. And the employment of RSM is to validate its effectiveness in this scenario, so parameter adjustment is not the focus. Parameters are set according to experience, therefore the parameters maybe not optimal.

Fig. 2.
figure 2

(a) Feature selection and classification; (b) Semi-circular path used for subjects’ walks, showing the Kinect sensor at the center equipped with dish to allow for tracking [9].

3 Experiment and Results

3.1 Skeleton Gait Dataset

Skeleton gait dataset used is public dataset Andersson et al. [9]. The dataset includes 140 subjects, each contains five sequences. In the data capturing procedure, each volunteer was asked to walk in front of the Kinect in a semi-circular. And a spinning dish helped the Kinect to keep the subject always in the center of its view. As shown in Fig. 2(b), each subject executed five round trip free cadence walk, starting from left, walking clockwise to the right and then back. Each sequence includes about 500–600 frames data.

3.2 Classification Results

Like most of the gait recognition algorithms, KNN is chosen as classifier. In order to achieve the highest recognition rate, 10-fold cross-validation was used for selecting K values i.e. the dataset was randomly partitioned in 10 subsets and training was performed 10 times, each time leaving one partition out of training process for testing; the accuracies are the average accuracy of these 10 executions. The search range of K is 1 to 70. Results showed that when K = 1, the three features sets’ accuracy all reached highest value, so all subsequent experiments used the parameter K = 1 (i.e. 1-NN).

Table 1 shows the average recognition accuracies when using different set of features. The relative motion features’ recognition accuracy on 140 subjects is 84.6 %, which is comparable to that of static features. When combined, recognition accuracy can reach 95.4 %, with a 9.3 % increase compared to static features, suggesting there is a good complementary between proposed relative motion features and static ones.

Table 1. Average recognition accuracy under different feature sets.

Taking into account the size of the gallery, proposed method was experimented under different gallery sizes. For a size P (P = 10, 20… 130), 10 subsets of this size are randomly drawn for gallery. The accuracies are the average over 10 subsets’. The results are shown in Fig. 3(a), proposed three gait features sets’ recognition accuracies are not very sensitive to changes in gallery size, especially when using combined gait features. Experimental results on various gallery sizes further validated that motion has sufficient recognition capability as the essential nature of gait.

Fig. 3.
figure 3

(a) Average recognition accuracy under different gallery sizes and different feature sets. Error bars represent one standard deviation; (b) VDF [11] and GA [9] and proposed relative motion features’ average recognition accuracy under different gallery sizes and different feature sets. Error bars represent one standard deviation.

When introducing the relative motion features of skeleton-based gait, it was mentioned that using only the mean value of relative distances is not enough; the standard deviation also represents the motion well. In order to prove this proposed viewpoint, Dynamic GF’s subsets STD and MEAN are used for recognition separately. The results are reported in Table 2. Though separately using STD’s performance is not very good, when combined with MEAN, the performance improves greatly than separately using MEAN, confirming the previous analysis.

Table 2. Average recognition accuracy under relative motion features and its subsets.

Proposed method used RSM to do feature selection on the proposed gait feature space. To find how much improvement does RSM brings, this paper also experimented without RSM. The results showed that RSM brings 2.2 %, 0.8 % and 1.0 %’s accuracy improvement for static, dynamic, and combined gait features, respectively.

3.3 Comparison to Other Methods

Except method proposed in [11], proposed relative motion features outperforms other methods with a large margin (i.e., over 20 %). The recognition accuracies of intuitive feature-based [8] and angle-based motion features [9, 10] are very low. The intuitive feature-based features’ poor performance is explained well in Sect. 1. Figure 3(b) shows how recognition accuracies of [9, 11] and proposed method vary for different gallery sizes. According to [9], the performance of Gait attributes (GA) decreases rapidly with the increase of gallery size. VDF developed in [11] belongs to absolute motion features, and achieve accuracy at 83.5 % on their 20 subjects’ dataset. In this paper, VDF was implemented and experimented on the 140 subjects’ dataset [9]. As shown in Fig. 3(b), when the gallery size is very small, such as 20, the implemented VDF’s recognition accuracy is slightly higher than the proposed method, but the difference is not significant; the implemented VDF’s recognition accuracy decrease more rapidly than proposed relative motion features as gallery size increase. To conclude, the discrimination of the relative motion features is higher than angle-based and absolute motion features.

In addition, when combined with the proposed static features, the recognition accuracy is up to 95.4 %.

4 Conclusion

This paper explored the relative motion features for gait recognition with Kinect. In this paper, relative distance-based motion features are proposed. Experimental results showed that the relative motion feature recognition accuracy is up 85 %, which is comparable to the static features. When we use the motion and static feature together, recognition accuracy is above 95 %.

These results suggest that motion is of significant recognition ability as the essential characteristics of gait; and relative motion gait features are an effective representation of gait, worthy of further study in a non-Kinect scene.