Human motion based intent recognition using a deep dynamic neural model
Introduction
The recognition of human intent is a basic requirement for human–robot cooperation and interaction. If a robot is to understand and even predict human intention, it may be capable of providing assistance in the form of humanized services more promptly. For this purpose, various kinds of feature extraction methods are needed to find sufficient characteristics for analyzing and understanding human intention.
Generally, the vision-based feature extraction method has proven to be an efficient method for the interaction between human and machine [1]. Since advanced sensor systems such as Asus Xtion [2] offer us a convenient way to capture human motion, we are able to capture the human tester’s skeletal information and record the position of each skeletal node in each frame. The sequential position of each skeletal node is used as initial input for our model.
Since human intention is not a momentary behavior but a continuous process, time series data are usually used for intention analysis [3], [4]. The hidden Markov model (HMM) is considered to be an efficient dynamic tool to model and classify sequences of motions [5], [6] and can also be used for intention recognition [7]. However, the HMM only considers the transitional probability of each state. It cannot represent the contextual meaning of different motions and intentions. Further, it is difficult to measure the transitional probability between two kinds of motion that may be represented as variant time series data. The probability of two people performing the same motion would be different.
The recurrent neural network (RNN) model proposed by Husken and Stagge [8] showed another possible solution to dynamic signal prediction and classification. As there are two kinds of output neurons in this RNN model (prediction neurons and classification neurons), it is able to predict the output and classify the signals at the same time. Another RNN-based model developed by Yamashita and Tani [9], the multiple timescale recurrent neural network (MTRNN), proved to be efficient to predict and generate dynamic signals. The MTRNN model is developed based on a continuous timescale recurrent neural network (CTRNN) [10]. An interesting aspect of MTRNN is that this model is able to generate some untrained continuous signals based on existing knowledge [11], [12]. It has been proven that the MTRNN is capable of efficiently predicting motion generation [13].
We considered the advantages of both models (MTRNN and RNN with two output layers) and developed a supervised MTRNN model that could be used for both motional classification and prediction in our previous work. Similar to the model of Husken and Stagge, both prediction and classification signals are generated simultaneously by the supervised MTRNN and can be used in a real-time process. The performance of supervised MTRNN in terms of motional classification has been proven to be efficient [14]. We wish to emphasize that this model has the ability to classify a lengthy untrained combination signal including several separately trained signals. Thus, this model is able to detect an unknown combination of motion signals if all elemental motion signals are trained. After we obtain the motion labels by analyzing the data sequence in each frame, we are able to get the intention labels by checking the data series between successive motions again. For this purpose, we need another supervised MTRNN model to detect the intention information based on the motion classification outputs.
An overview of our model is presented in Fig. 1. When a motion is observed, the model in the first MTRNN layer should recognize the performed motion. The motion label which is the output of the 1st layer will be reused as input for the 2nd layer and the intention label is obtained at the same time. Two different combinations of motional sequences may lead to two different intent recognition results even though some of their elemental motions are same. On the other hand, different intentions may end with the same motion. In this case, understanding the sequence of motions preceding the final motion is essential to recognize complex human intention.
In this paper, we considered eight kinds of meaningful motions and five kinds of human intentions. The motional classification ability, as well as the intention recognition performance is also evaluated. Moreover, the robustness of comparing our deep dynamic neural model with a single layer supervised MTRNN model to recognize human intention is also demonstrated.
Related work is introduced in Section 2. The proposed deep dynamic neural structure is introduced in Section 3. Section 4 presents the experimental results, which demonstrate that the proposed deep supervised MTRNN is able to classify different intentions as well as to distinguish between different human motions.
Section snippets
Encoding criteria for prediction and classification
We used Asus Xtion to extract skeletal nodes relating to human motion and to record their x and y position sequences. The normalization method was introduced in our previous work [14]. The self-organizing map (SOM), which is commonly used as a pre-processing method for MTRNN feature extraction [11], [12], [15], is also used in our model. The input visual information is extracted using the following formula: where is the reference
Motivation
The tree structure is often used to describe the composition of intent [17], [18]. Since our work is based on human motion, we believe that different intentions are composed of different motional sequences. Our aim is therefore to determine human intent by observing human motional sequences. A key feature of a tree-like structure is that different leaves may have the same root, implying that different intentions may start with the same motion. Thus, knowledge of a human agent’s current motion
Motion classification result
The experiment was performed with an IBM personal computer running Microsoft Windows 7 on an Intel Core (TM) i7-3770 processor with a 3.40 GHz clock speed and 8 GB of memory. In our experiments, since it was only possible to obtain the skeletal data of one person, we assumed that there were only two objects (one cup and one container with additive) on the desk. We also assumed that every grasping action would succeed in securing an object. The eight different kinds of motions and five kinds of
Discussion and conclusion
In this paper we considered a deep structure supervised MTRNN model and proposed a dynamic classification model for intention recognition. Our method was evaluated with experiments using Asus Xtion. Our model is able to obtain real-time motion and intention classification simultaneously. The efficiency and robustness of the deep supervised MTRNN model was proven to be superior to that of the single supervised MTRNN model. Although our model is not able to recognize human intention instantly, it
Acknowledgments
This work was supported by the Industrial Strategic Technology Development Program (10044009, Development of a self-improving bidirectional sustainable HRI technology for 95% of successful responses with understanding user’s complex emotion and transactional intent through continuous interactions) funded by the Ministry of Knowledge Economy (MKE, Korea) (70%) and was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry
Zhibin Yu is currently a Ph.D. candidate, School of Electrical Engineering & Computer Science, Kyungpook National University, Taegu, Korea. His research interests include brain science and engineering, machine learning, pattern recognition, neural networks, feature extraction, and human–machine interaction.
References (23)
A survey on vision-based human motion recognition
Image Vis. Comput.
(2010)- et al.
Recurrent neural networks for time series classification
Neurocomputing
(2003) - et al.
Approximation of dynamical systems by continuous time recurrent neural networks
Neural Netw.
(1993) - et al.
Erratum to: understanding Japanese tourists’ shopping preferences using the decision tree analysis method
Tour. Manage.
(2011) - et al.
Assessing the potential of low-cost 3D cameras for the rapid measurement of plant woody structure
Sensors
(2013) - N. Stefanov, A. Peer, M. Buss, Online intention recognition for computer-assisted teleoperation, in: IEEE International...
- et al.
Towards automatic skill evaluation: detection and segmentation of robot-assisted surgical motions
Comput. Aided Surg.
(2006) - C. Joslin, A. El-Sawah, Q. Chen, N. Georganas, Dynamic gesture recognition, in: IMTC 2005 — Instrumentation and...
- D. Gehrig, H. Kuehne, A. Woerner, T. Schultz, HMM-based human motion recognition with optical flow data, in: 9th...
- D. Aarno, D. Kragic, Layered HMM for Motion Intention Recognition, Intelligent Robots and Systems, 2006 IEEE/RSJ...
Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment
PLoS Comput. Biol.
Cited by (0)
Zhibin Yu is currently a Ph.D. candidate, School of Electrical Engineering & Computer Science, Kyungpook National University, Taegu, Korea. His research interests include brain science and engineering, machine learning, pattern recognition, neural networks, feature extraction, and human–machine interaction.
Minho Lee received the Ph.D. degree from Korea Advanced Institute of Science and Technology in 1995, and is currently a professor of School of Electrical Engineering & Computer Science, Kyungpook National University, Taegu, Korea. His research interests include active vision systems based on human eye movements, selective attention, object perception and intelligent sensor system.