Elsevier

Robotics and Autonomous Systems

Volume 71, September 2015, Pages 134-149
Robotics and Autonomous Systems

Human motion based intent recognition using a deep dynamic neural model

https://doi.org/10.1016/j.robot.2015.01.001Get rights and content

Highlights

  • We developed an online deep dynamic neural model for intention classification.

  • We evaluated the importance of internal action generation in motion classification and intention classification.

  • Our proposed model performances better than a single layer supervised MTRNN.

  • The possibility of each intention is able to be detected based on our model.

Abstract

The understanding of human intent based on human motions remains a highly relevant and challenging research topic. The relationship of the sequence of human motions may be a possible solution to recognize human intention. The supervised multiple timescale recurrent neural network (supervised MTRNN) model is a useful tool for motion classification. In this paper, we propose a new model to understand human intention based on human motions in real-time through a deep structure including two supervised MTRNN models, which are based on understanding the meaning of a series of human motions. The 1st supervised MTRNN layer classifies motion labels while the 2nd supervised MTRNN layer in the deep dynamic neural structure identifies human intention using the results of the 1st supervised MTRNN. We also considered the action–perception cycle effect between the 1st and the 2nd supervised MTRNNs, in which the motion label perception and internal action (motion prediction) form a cycle to improve the motion classification and intent recognition performance. A group of tasks was designed around movements involving two objects in an attempt to detect different motions and intentions based on the proposed deep dynamic neural model. The experimental results showed the deep supervised MTRNN to be more robust and to outperform the single layer supervised MTRNN model for detecting human intention. The action–perception cycle was found to efficiently improve both motion classification and prediction, which is important for human intent recognition.

Introduction

The recognition of human intent is a basic requirement for human–robot cooperation and interaction. If a robot is to understand and even predict human intention, it may be capable of providing assistance in the form of humanized services more promptly. For this purpose, various kinds of feature extraction methods are needed to find sufficient characteristics for analyzing and understanding human intention.

Generally, the vision-based feature extraction method has proven to be an efficient method for the interaction between human and machine  [1]. Since advanced sensor systems such as Asus Xtion  [2] offer us a convenient way to capture human motion, we are able to capture the human tester’s skeletal information and record the position of each skeletal node in each frame. The sequential position of each skeletal node is used as initial input for our model.

Since human intention is not a momentary behavior but a continuous process, time series data are usually used for intention analysis  [3], [4]. The hidden Markov model (HMM) is considered to be an efficient dynamic tool to model and classify sequences of motions  [5], [6] and can also be used for intention recognition  [7]. However, the HMM only considers the transitional probability of each state. It cannot represent the contextual meaning of different motions and intentions. Further, it is difficult to measure the transitional probability between two kinds of motion that may be represented as variant time series data. The probability of two people performing the same motion would be different.

The recurrent neural network (RNN) model proposed by Husken and Stagge  [8] showed another possible solution to dynamic signal prediction and classification. As there are two kinds of output neurons in this RNN model (prediction neurons and classification neurons), it is able to predict the output and classify the signals at the same time. Another RNN-based model developed by Yamashita and Tani  [9], the multiple timescale recurrent neural network (MTRNN), proved to be efficient to predict and generate dynamic signals. The MTRNN model is developed based on a continuous timescale recurrent neural network (CTRNN)  [10]. An interesting aspect of MTRNN is that this model is able to generate some untrained continuous signals based on existing knowledge  [11], [12]. It has been proven that the MTRNN is capable of efficiently predicting motion generation  [13].

We considered the advantages of both models (MTRNN and RNN with two output layers) and developed a supervised MTRNN model that could be used for both motional classification and prediction in our previous work. Similar to the model of Husken and Stagge, both prediction and classification signals are generated simultaneously by the supervised MTRNN and can be used in a real-time process. The performance of supervised MTRNN in terms of motional classification has been proven to be efficient  [14]. We wish to emphasize that this model has the ability to classify a lengthy untrained combination signal including several separately trained signals. Thus, this model is able to detect an unknown combination of motion signals if all elemental motion signals are trained. After we obtain the motion labels by analyzing the data sequence in each frame, we are able to get the intention labels by checking the data series between successive motions again. For this purpose, we need another supervised MTRNN model to detect the intention information based on the motion classification outputs.

An overview of our model is presented in Fig. 1. When a motion is observed, the model in the first MTRNN layer should recognize the performed motion. The motion label which is the output of the 1st layer will be reused as input for the 2nd layer and the intention label is obtained at the same time. Two different combinations of motional sequences may lead to two different intent recognition results even though some of their elemental motions are same. On the other hand, different intentions may end with the same motion. In this case, understanding the sequence of motions preceding the final motion is essential to recognize complex human intention.

In this paper, we considered eight kinds of meaningful motions and five kinds of human intentions. The motional classification ability, as well as the intention recognition performance is also evaluated. Moreover, the robustness of comparing our deep dynamic neural model with a single layer supervised MTRNN model to recognize human intention is also demonstrated.

Related work is introduced in Section  2. The proposed deep dynamic neural structure is introduced in Section  3. Section  4 presents the experimental results, which demonstrate that the proposed deep supervised MTRNN is able to classify different intentions as well as to distinguish between different human motions.

Section snippets

Encoding criteria for prediction and classification

We used Asus Xtion to extract skeletal nodes relating to human motion and to record their x and y position sequences. The normalization method was introduced in our previous work  [14]. The self-organizing map (SOM), which is commonly used as a pre-processing method for MTRNN feature extraction [11], [12], [15], is also used in our model. The input visual information is extracted using the following formula: yi,t=exp(vi,tvteach,t2σ)jVexp(vj,tvteach,t2σ) where vi is the reference

Motivation

The tree structure is often used to describe the composition of intent [17], [18]. Since our work is based on human motion, we believe that different intentions are composed of different motional sequences. Our aim is therefore to determine human intent by observing human motional sequences. A key feature of a tree-like structure is that different leaves may have the same root, implying that different intentions may start with the same motion. Thus, knowledge of a human agent’s current motion

Motion classification result

The experiment was performed with an IBM personal computer running Microsoft Windows 7 on an Intel Core (TM) i7-3770 processor with a 3.40 GHz clock speed and 8 GB of memory. In our experiments, since it was only possible to obtain the skeletal data of one person, we assumed that there were only two objects (one cup and one container with additive) on the desk. We also assumed that every grasping action would succeed in securing an object. The eight different kinds of motions and five kinds of

Discussion and conclusion

In this paper we considered a deep structure supervised MTRNN model and proposed a dynamic classification model for intention recognition. Our method was evaluated with experiments using Asus Xtion. Our model is able to obtain real-time motion and intention classification simultaneously. The efficiency and robustness of the deep supervised MTRNN model was proven to be superior to that of the single supervised MTRNN model. Although our model is not able to recognize human intention instantly, it

Acknowledgments

This work was supported by the Industrial Strategic Technology Development Program (10044009, Development of a self-improving bidirectional sustainable HRI technology for 95% of successful responses with understanding user’s complex emotion and transactional intent through continuous interactions) funded by the Ministry of Knowledge Economy (MKE, Korea) (70%) and was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry

Zhibin Yu is currently a Ph.D. candidate, School of Electrical Engineering & Computer Science, Kyungpook National University, Taegu, Korea. His research interests include brain science and engineering, machine learning, pattern recognition, neural networks, feature extraction, and human–machine interaction.

References (23)

  • Y. Yamashita et al.

    Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment

    PLoS Comput. Biol.

    (2008)
  • Cited by (0)

    Zhibin Yu is currently a Ph.D. candidate, School of Electrical Engineering & Computer Science, Kyungpook National University, Taegu, Korea. His research interests include brain science and engineering, machine learning, pattern recognition, neural networks, feature extraction, and human–machine interaction.

    Minho Lee received the Ph.D. degree from Korea Advanced Institute of Science and Technology in 1995, and is currently a professor of School of Electrical Engineering & Computer Science, Kyungpook National University, Taegu, Korea. His research interests include active vision systems based on human eye movements, selective attention, object perception and intelligent sensor system.

    View full text