Keywords

1 Introduction

The analysis of human activity is a critical component for applications in fields such as health, security, sports, among others. Performing this task in an automatic manner is challenging and has prompted several researchers to attempt a multitude of approaches [3, 5, 19]. Among the most common devices used for this task are depth cameras (Kinect®). Some approaches use the spatial coordinates of human body joints and then compute feature vectors that can be used for classification. In [16], the authors use polar coordinates for the characterization of joints in order to achieve higher performance in activity classification. Other methods use classifiers (K-means, SVM) to generate a codebook with key postures and subsequently employ a Hidden Markov Model (HMM) to recognize the different combinations of postures and thus identify the activity being performed. However, all these methods present limitations caused by partial occlusions of the target [4, 15].

Sensors such as Inertial Measurement Units (IMUs) are also used for activity recognition [2, 18]. However, these sensors require high processing capabilities [13], and most of the time a single sensor is not sufficient to perform satisfactory detection [1]. Electromyographic signal sensors (EMG) are also useful for activity recognition [6, 14]. However, sophisticated signal processing and multiple sensors are also required for adequate detection accuracy. Kang et al. use Mel-Frequency Cepstral Coefficients (MFCC), obtaining an activity recognition accuracy of 85% [11]. Korbinian et al. use an HMM for activity recognition and neural networks for motion segmentation, reporting high accuracy rates between 93% and 100% [12]. There is a consensus that fusing data from different sensors improves human activity recognition systems [5, 16]. Also, a single sensor modality is generally not capable of identifying the wide range of human activities. Although several methods for human activity recognition that use multi-modal fusion approaches have been proposed [7, 9], few techniques take more than two sensing modalities into account at the same time. Among the few recent works that do use multi-modal approaches, Zhand et al. employ a model that is based on primitive motions to classify movements using Bag of Features (BOF) techniques with histograms of primitive symbols [17]. In particular, to the best of our knowledge, no existing method fuses the information of IMUs, EMGs, and depth sensors simultaneously. Therefore, we propose a fusion method that combines the strengths of each sensor to provide better performance.

2 Proposed Method

This paper proposes an activity recognition method based on primitive motion detection. Our method is comprised of two main steps. First, we analyze the sensor data over a small time window to perform primitive motion classification, creating a motion sequence from each sensor. Second, this sequence of primitives is fed into a Hidden Markov Model that classifies the overall activity. An overview of the prediction and training methods is shown in Fig. 1. To validate our method, we built an annotated database containing 5 different human activities. Each activity was performed 3 times by 16 different individuals. For each subject, we captured raw data from 4 IMUs, 4 EMG sensors, and a Kinect® device. Our dataset is publicly available at https://goo.gl/6F82wd.

Fig. 1.
figure 1

Overview of the training and classification processes of our proposed approach.

2.1 Primitive Motions Recognition

Models based on primitive motions are inspired on techniques from human speech analysis  [8]. In speech recognition, phrases are generally divided into isolated phonemes. These phoneme models are used as basic blocks in order to build words and phrases in a hierarchical way [10]. Our motion detection model follows a similar idea to that of Zhand et al. [17] in the sense that each activity is represented as a sequence of sub-movements, or primitive motions, generating a unique signature that will be used for classification of the overall activity.

Primitive Motions Encoding. In this work, we propose eight primitive motions to train the HMM system. These motions are: (1) Repose, (2) Partially crouched, (3) Fully crouched, (4) In midair, (5) Quarter rise arm, (6) Three-quarters arm rise, (7) Step forward with right foot and (8) Step forward with left foot.

Feature Extraction. From each sensor modality, a set of features is extracted from the video sequence during a time observation window, which was set experimentally to 3 s. For the Kinect®, the descriptor vector is obtained from the 14 human pivot points. The sensor is able to provide data at 30 samples per second. However, our feature vector is composed of groups of 3 samples, corresponding to an overall rate of 10 samples per second. Given the set of body joints in Cartesian coordinates, all these points are converted to polar coordinates vis-a-vis the center of mass:

$$\begin{aligned} P_{i}=\left[ r_{1}\,\theta _{1}\,r_{2}\,\theta _{2} \ldots r_{14}\,\theta _{14}\right] , \end{aligned}$$
(1)

where i is the i-th sample window, with \(i=\left\{ 1,2,3\right\} \). In addition, the mean m and standard deviation v are computed over all the coordinates. The final feature vector for the Kinect sensor is then defined according to

$$\begin{aligned} \text{ KIT }=\left[ P_{1}\,P_{2}\,P_{3}\,m_{x}\,m_{y}\,m_{z}\,m_{r}\,m_{\theta }\,v_{x}\,v_{y}\,v_{z}\,v_{r}\,v_{\theta }\right] . \end{aligned}$$
(2)

For the IMUs, 4 sensors were attached near the wrists and knees of the subjects. Each IMU provides 30 samples per second. Again, we used the average of 3 samples to compute our features. Therefore, the IMU vector is also available at 10 samples per second. With the IMU data \(I_{k}=\left[ a_{x}\,a_{y}\,a_{z}\,a_{\theta }\,a_{\phi }\right] _{1\times 5}\), where \(k=\left\{ 1,2,3,4\right\} \) is the k-th IMU, we compute the following descriptors: (1) Features based on the physical parameters of the human motion [18], and (2) Statistical Descriptors. The overall IMU descriptor is a combination of the \(IMU_{k}\) descriptor for each of the sensors in the network, i.e.,

$$\begin{aligned} IMUF=\left[ IMU_{1}\,IMU_{2}\,IMU_{3}\,IMU_{4}\right] . \end{aligned}$$
(3)

For the EMGs, we track the activity of 4 body muscles. We obtain the signal \(E_{i}\) from each muscle at a sampling frequency of 2 kHz, where i is the i-th EMG sensor. \(E_{i}\) is segmented by using \(V_{j}\) windows of 200 samples where j is the jth window. Each window \(V_{j}\) is concatenated to form a vector \(W_i\) and this vector is characterized by a Daubechies Wavelet transform with 35 orthogonal coefficients and 6 levels, which produces the feature vector \(\text{ EMG }_{1\times 1300}\).

Motion Classification. We use three multi-class support vector machines with classification strategy “One-vs-All”  with Gaussian kernels to separate the data. The same process is used with the Kinect, IMU, and EMG sensors.

2.2 Activity Recognition

We use our set of primitive motions described in Sect. 2.1 to classify the following activities: (1) Stand still, (2) Squat and stand up, (3) Jump, (4) Raise right hand, and (5) Jog. To classify each activity from this set, the outputs of the three SVMs are used as input to an HMM. An HMM is chosen because it has been successfully used to detect and encode sequences over time (i.e., the ones produced by the SVMs). Deep learning methods can also be explored in a future work.

Hidden Markov Model Classification (HMM). As described in Sect. 2.1, each SVM classifier generates a label that corresponds to the information provided by the different sensors. The vectors EI correspond to the network of IMUs, EK to the Kinect® device, and EE to the \(\text{ EMG }\)s. The data fusion process consists of generating a EF feature vector with the labels from the SVM classifiers. EF is built by concatenating each classifier label during motion capture.

$$\begin{aligned} EF=\left[ \left[ EK_{1}\,EK_{2} \ldots EK_{30}\right] \left[ EI_{1}\,EI_{2} \ldots EI_{30}\right] \left[ EE_{1}\,EE_{2} \ldots EE_{30}]\right] _{90\times 1}\right. \end{aligned}$$
(4)

2.3 Training and Validation Process

We train our multi-class SVM models using sequential minimal optimization (SMO). For HMM training, we used 24 states and 32 centroids for the construction of the codebook. We evaluate our models using a cross-validation strategy that partitions the database with 70% of the data for training and 30% for evaluation and generate the confusion matrix for each classifier. This process applies a Monte Carlo analysis, where the stop criterion is defined by

$$\begin{aligned} \left\| \text {diag}\left( M_{k}\right) -\text {diag}\left( M_{k-1}\right) \right\| _{2}<th, \end{aligned}$$
(5)

where \(M_k\) is the confusion matrix at iteration k and th is the error threshold.

3 Results

We show the results to validate the performance of our method as a function of the sensors used to collect the data. Initially, we evaluate the performance of every sensor modality and the different combinations of sensors. The assessment is based on two basic steps: primitive motion analysis and activity recognition analysis. The first step carries out the performance analysis of the SVM classifiers for the proposed primitive motions. The second step validates the human activity classification using an HMM.

Table 1. Traces of the confusion matrices for the primitive motion classification analysis.

3.1 Primitive Motion Analysis

We use the validation approach described in this section to obtain the confusion matrices of the Kinect®, IMU, and EMG sensors. The traces of the recognition confusion matrices of the primitive movements (using all the sensors as well as the minimum number of sensors that guarantees a reliable detection performance) are shown in Table 1. The Kinect® sensor provides the best primitive movement detection results with an average detection value of approximately \(85\%\), which is substantially higher than those of the other sensors. The analysis of the set of IMU sensors demonstrates a comparable performance with the Kinect in the first three primitive movements. While the EMG sensors alone perform relatively poorly, they can still obtain a precision higher than \(70\%\) for classes 1, 3 and 4.

We evaluated the performance of the subsets of sensors by systematically removing the features corresponding to each sensor from our classification system. In columns 4 and 6 of Table 1, we report the results obtained from the subsets that showed the best performance. As the table indicates, while removing a single IMU sensor results in a substantial accuracy reduction for class 6 and a more modest reduction for class 5, the other activities remain mostly at the same performance level. The subset of EMG sensors, on the other hand, show comparable performance for most classes.

3.2 Activity Recognition Analysis

Table 2 shows the traces of the confusion matrices for the HMM-based activity recognition for each sensor category. The results correspond to 181 Monte Carlo iterations for each sensor. Our results demonstrate that the Kinect or the IMU sensors alone provide high classification accuracy for all the activities. The EMG sensors show high classification performance for activities 2, 3, 4 and 5. Also shown in the table are the results of 30 Monte Carlo iterations using a single IMU sensor, which demonstrate that it is possible to recognize all the activities with a single sensor.

Table 2. Traces of the confusion matrices for the activity recognition analysis.
Table 3. Performance comparison for different combinations of sensors.

The results obtained using combined sensors are reported in Table 3, which shows the average value of the main diagonal of the confusion matrices as well as their uncertainty intervals with a confidence rate of \(99\%\). As shown in the table, the Kinect®+IMU+EMG and the Kinect®+IMU combinations show the best overall performance, with a success rate of 100% for class 1 in both cases and comparable results for the other classes. By comparing these results with those shown in Table 2, we can see that combining the Kinect® and EMG sensors improves the activity recognition performance by \(4.66\%\) with respect to the Kinect® sensor alone and \(14.88\%\) for the EMG sensor. The integration of the IMU and EMG sensors yields a similar performance improvement.

4 Conclusions

We developed an automatic method for human activity recognition based on multi-modal data fusion from a network of IMU and EMG sensors and a Kinect® sensor. Our approach uses multi-class support vector machines for primitive movement detection and subsequently classifies the activity according to the sequences provided by the SVM outputs over a time interval using an HMM. This work studies the contribution of each sensor to the recognition task by evaluating the performance of different sensor configurations. To perform robust activity recognition, it is necessary to use all the sensors due to the potential failures that these devices might show during the process. These failures include partial occlusions or self-occlusions from the Kinect® or connection losses in the wireless communication systems, which are commonly used to acquire data from the IMU or EMG sensors. Multi-modal information from every sensor might mitigate mistakes caused by such failures. The proposed approach was tested in an annotated dataset that was created specifically for this work, because there was no publicly available database with synchronized recording of these three sensor modalities. We made the dataset publicly available to facilitate comparisons and accelerate the research in this area. In the future, the database must be expanded to validate our approach on a wider set of activities.